2.2 Main softwares of interest
FastQC- Sequence data quality controlSortMeRNA- rRNA filteringTrimmomatic- Low quality read and adapter trimmingSTAR- Read mappingHTSeqandKallisto- Feature summarization
Just a few notes about the present tutorial. Most text will be formatted as present, but at times there will be:
linux command lines:
command-line-tool -option value -option value argument argument
R code without output:
a <- 1
(a <- 1)
## [1] 1
Finally, as this is a tutorial, you will be subjected to exercises. You should only proceed after answering the question asked. To answer the question, go to _http://socrative.com_, select “Student Login” and use the room number you were provided with. You will be taken directly to the active quizz, which you can answer at your own pace, however, you can answer a question only once. Once you have submitted your answer, you can press “Close” in the present tutorial; the current section of the tutorial will collapse (you can always re-open it, e.g. if you need clarifications about the question) and the next section will open. If anything were unclear at any time, do not hesitate to contact us on slack.
This tutorial adresses several aspect of an RNA-Seq data analysis. First, it deals with data pre-processing (a process rather agnostic to the type of NGS analysis, i.e. it would also be valid for ChIP-Seq or DNA-Seq). That part of the tutorial introduces the use of the command line (specifically the bourne-again shell: bash) and several software tools to prepare sequencing data for downstrem analysis. Specifically, we will be covering quality control, rRNA sorting, quality trimming, genome mapping and feature summarization. If these terms don’t mean much yet, hopefully they will by the end of today. Next, it deals with the manipulation of alignment files (in BAM format) and the obtention and manipulation of genomic (and genic) annotations to subsequently combine them to obtain a raw count table1. This count table is the necessary minimal input for a Differential Expression analysis, which is the ultimate step in this tutorial.
The goals of the tutorial are:
FastQC - Sequence data quality control
SortMeRNA - rRNA filtering
Trimmomatic - Low quality read and adapter trimming
STAR - Read mapping
HTSeq and Kallisto - Feature summarization
Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput genomic data. Among these, we will focus on a few of them principally: GenomicAlignments, GenomicFeatures and GenomicRanges. We will look at these packages (i) as an example to understand what the process of feature summarization is (i.e. taking a look under the hood) and (ii) as an introduction to computational optimisation in R.
We will, in addition, use two packages for Differential Expression data analysis: tximport and DESeq2.
Recent technological developments introduced high-throughput sequencing approaches. A variety of experimental protocols and analysis workflows address gene expression, regulation, and encoding of genetic variants. Experimental protocols produce a large number (tens to hundreds of millions per sample) of short (e.g.: 35-250, single or paired-end) nucleotide sequences. These are aligned to a reference or other genome or transcriptome. Analysis workflows use the alignments to infer levels of gene expression (RNA-seq), binding of regulatory elements to genomic locations (ChIP-seq), or prevalence of structural variants (e.g.: SNPs, short indels, large-scale genomic rearrangements). Sample sizes range from minimal replication (e.g.: 2 samples per treatment group) to thousands of individuals.
RNA-Seq (RNA-Sequencing) has become the preferred method for measuring gene expression, providing an accurate proxy for absolute quantitation of messenger RNA (mRNA) levels within a sample (Mortazavi and others 2008). RNA-Seq has reached rapid maturity in data handling, QC (Quality Control) and downstream statistical analysis methods, taking substantial benefit from the extensive body of literature developed on the analysis of microarray technologies and their application to measuring gene expression. Although analysis of RNA-Seq remains more challenging than for microarray data, the field has now advanced to the point where it is possible to define mature pipelines and guidelines for such analyses. However, with the exception of commercial software options such as the CLCbio CLC Genomics Workbench, for example, we are not aware of any fully integrated open-source pipelines for performing these pre-processing steps. Both the technology behind RNA-Seq and the associated analysis methods continue to evolve at a rapid pace, and not all the properties of the data are yet fully understood. Hence, the steps and available software tools that could be used in such a pipeline have changed rapidly in recent years and it is only recently that it has become possible to propose a de-facto standard pipeline. Although proposing such a skeleton pipeline is now feasible there remain a number of caveats to be kept in mind in order to produce biologically and statistically sound results.
A pipeline, which we have developed for non-model plant organisms but is generally applicable can be seen in the figure below and a detailed description of it has been published2.
The pre-processing pipeline used during the course
As a running example, we use a dataset derived from a study performed in P. tremula (K. Robinson et al. 2014). The authors test the evolutionary theory suggesting that males and females may evolve sexually dimorphic phenotypic and biochemical traits concordant with each sex having different optimal strategies of resource investment to maximise reproductive success and fitness. Such sexual dimorphism should result in sex biased gene expression patterns in non-floral organs for autosomal genes associated with the control and development of such phenotypic traits. Hence, the authors, among other approaches have looked for gene expression differences between sex. This was achieved by an RNA-Seq differential expression analysis between samples grouped by sex. The samples have been sequenced on an Illumina HiSeq 2000, using TruSeq adapters, through a 101 cycles paired-end protocol, which yielded between 11 million and 34 million reads per sample.
In the following sections, we look at a subset of the data, corresponding to reads obtained from individuals of the RNA-seq experiment, and aligned to the P. trichocarpa reference genome. The reason why the reads were aligned to the P. trichocarpa genome is that we are currently establishing the P. tremula genome. P. trichocarpa is a closely related species (the latter present in north-western america while the former grows in northern and central europe).
As a side note, reads were retrieved from the European Nucleotide Archive (ENA, accession ID ERP002471), and were aligned to the P. trichocarpa reference genome using the STAR (Dobin et al. 2013) aligner.
There are numerous tools to support developing programs and softwares in R. For this course, we have selected one of them: the RStudio environment, which provides a feature-full, user-friendly, cross-platform environment for working with R.
This chapter3 introduces R and Bioconductor in details. During the course, we will only scheme through it as you are supposed to be familiar with most of the concepts presented here. If that were not the case - and even if it were - consider it as homework.
Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput genomic data. Bioconductor started more than 10 years ago and gained credibility for its statistically rigorous approach to microarray pre-processing and analysis of designed experiments, and integrative and reproducible approaches to bioinformatic tasks. There are now more than 800 Bioconductor packages for expression and other microarrays, sequence analysis, flow cytometry, imaging, and other domains. The Bioconductor web site provides installation, package repository, help, and other documentation.
The Bioconductor web site is at bioconductor.org. Features include:
Introductory workflows.
A manifest of Bioconductor packages arranged in BiocViews.
Annotation (data bases of relevant genomic information, e.g.: Entrez gene ids in model organisms, KEGG pathways) and experiment data (containing relatively comprehensive data sets and their analysis) packages.
Mailing lists, including searchable archives, as the primary source of help.
Course and conference information, including extensive reference material.
General information about the project.
Package developer resources, including guidelines for creating and submitting new packages.
You would have realized that all the links above are useful ones (and one is a shameless self-citation)
Many academic and commercial software products are available; why would one use R and ? One answer is to ask about the demands high-throughput genomic data places on effective computational biology software.
High-throughput questions make use of large data sets. This applies both to the primary data (microarray expression values, sequenced reads, ) and also to the annotations on those data (coordinates of genes and features such as exons or regulatory regions; participation in biological pathways, ). Large data sets place demands on our tools that preclude some standard approaches, such as spread sheets. Likewise, intricate relationships between data and annotation, and the diversity of research questions, require flexibility typical of a programming language rather than a narrowly-enabled graphical user interface.
Analysis of high-throughput data is necessarily statistical. The volume of data requires that it be appropriately summarized before any sort of comprehension is possible. The data are produced by advanced technologies, and these introduce artifacts (e.g.: probe-specific bias in microarrays; sequence or base calling bias in RNA-seq experiments) that need to be accommodated to avoid incorrect or inefficient inference. Data sets typically derive from designed experiments, requiring a statistical approach both to account for the design and to correctly address the large number of observed values (e.g.: gene expression or sequence tag counts) and small number of samples accessible in typical experiments.
Research needs to be reproducible. Reproducibility is both an ideal of the scientific method, and a pragmatic requirement. The latter comes from the long-term and multi-participant nature of contemporary science. An analysis will be performed for the initial experiment, revisited again during manuscript preparation, and revisited during reviews or in determining next steps. Likewise, analyses typically involve a team of individuals with diverse domains of expertise. Effective collaborations result when it is easy to reproduce, perhaps with minor modifications, an existing result, and when sophisticated statistical or bioinformatic analysis can be effectively conveyed to other group members.
Science moves very quickly. This is driven by the novel questions that are the hallmark of discovery, and by technological innovation and accessibility. Rapidity of scientific development places significant burdens on software, which must also move quickly. Effective software cannot be too polished, because that requires that the correct analyses are ‘known’ and that significant resources of time and money have been invested in developing the software; this implies software that is tracking the trailing edge of innovation. On the other hand, leading-edge software cannot be too idiosyncratic; it must be usable by a wider audience than the creator of the software, and fit in with other software relevant to the analysis.
Effective software must be accessible. Affordability is one aspect of accessibility. Another is transparent implementation, where the novel software is sufficiently documented and source code accessible enough for the assumptions, approaches, practical implementation decisions, and inevitable coding errors to be assessed by other skilled practitioners. A final aspect of affordability is that the software is actually usable. This is achieved through adequate documentation, support forums, and training opportunities.
What features of R and Bioconductor contribute to its effectiveness as a software tool?
Bioconductor is well suited to handle extensive data and annotation. Bioconductor ‘classes’ represent high-throughput data and their annotation in an integrated way. Bioconductor methods use advanced programming techniques or R resources (such as transparent data base or network access) to minimize memory requirements and integrate with diverse resources. Classes and methods coordinate complicated data sets with extensive annotation. Nonetheless, the basic model for object manipulation in R involves vectorized in-memory representations. For this reason, particular programming paradigms (e.g.: block processing of data streams; explicit parallelism) or hardware resources (e.g.: large-memory computers) are sometimes required when dealing with extensive data.
R is ideally suited to addressing the statistical challenges of high-throughput data. Three examples include the development of the ‘RMA’ affy and other normalization algorithm for microarray pre-processing, use of moderated \(t\)-statistics for assessing microarray differential expression Limma, and development of negative binomial approaches to estimating dispersion read counts necessary for appropriate analysis of RNA-Seq designed experiments (Anders and Huber 2010),(M. D. Robinson, McCarthy, and Smyth 2010).
Many of the ‘old school’ aspects of R and Bioconductor facilitate reproducible research. An analysis is often represented as a text-based script. Reproducing the analysis involves re-running the script; adjusting how the analysis is performed involves simple text-editing tasks. Beyond this, R has the notion of a ‘vignette’, which represents an analysis as a LaTeX document with embedded R commands. The R commands are evaluated when the document is built, thus reproducing the analysis. The use of LaTeX, and more recently of simpler markdown languages in conjonction with pandoc to generate HTML formatted reports or vignettes means that the symbolic manipulations in the script are augmented with textual explanations and justifications for the approach taken; these include graphical and tabular summaries at appropriate places in the analysis. R includes facilities for reporting the exact version of R and associated packages used in an analysis so that, if needed, discrepancies between software versions can be tracked down and their importance evaluated. While users often think of R packages as providing new functionality, packages are also used to enhance reproducibility by encapsulating a single analysis. The package can contain data sets, vignette(s) describing the analysis, R functions that might have been written, scripts for key data processing stages, and documentation (via standard R help mechanisms) of what the functions, data, and packages are about.
The Bioconductor project adopts practices that facilitate reproducibility. Versions of R and Bioconductor are released once and twice each year, respectively. Each Bioconductor release is the result of development, in a separate branch, during the previous six months. The release is built daily against the corresponding version of R on Linux, Mac, and Windows platforms, with an extensive suite of tests performed. The biocLite function ensures that each release of R uses the corresponding Bioconductor packages. The user thus has access to stable and tested package versions. R and Bioconductor are effective tools for reproducible research.
R and Bioconductor exist on the leading portion of the software life cycle. Contributors are primarily from academic institutions, and are directly involved in novel research activities. New developments are made available in a familiar format, i.e.: the R language, packaging, and build systems. The rich set of facilities in R - e.g.: for advanced statistical analysis or visualization - and the extensive resources in Bioconductor - e.g.: for annotation using third-party data such as Biomart (www.biomart.org(Kasprzyk 2011)) or UCSC genome browser tracks(http://genome.ucsc.edu/(Meyer et al. 2013)) - mean that innovations can be directly incorporated into existing workflows. The ‘development’ branches of R and Bioconductor provide an environment where contributors can explore new approaches without alienating their user base.
R and Bioconductor also fair well in terms of accessibility. The software is freely available. The source code is easily and fully accessible for critical evaluation. The R packaging and check system requires that all functions are documented. Bioconductor requires that each package contain vignettes to illustrate the use of the software. There are very active R and Bioconductor mailing lists for immediate support, and regular training and conference activities for professional development.
The table below enumerates many of the packages available for sequence analysis. The table includes packages for representing sequence-related data (e.g.: GenomicRanges, Biostrings), as well as domain-specific analysis such as RNA-seq (e.g.: edgeR, DEXSeq), ChIP-seq (e.g,. ChIPpeakAnno, DiffBind), and SNPs and copy number variation (e.g.: genoset, ggtools, VariantAnnotation).
| Concept | Packages |
|---|---|
| Data representation | IRanges, GenomicRanges, GenomicFeatures,Biostrings, BSgenome, girafe,genomeIntervals |
| Input / output | ShortRead(fastq), Rsamtools (bam), rtracklayer (gff, wig, bed), VariantAnnotation (vcf),R453Plus1Toolbox(454) |
| Annotation | biomaRt(Durinck et al. 2005),GenomicFeatures,ChIPpeakAnno, VariantAnnotation |
| Alignment | Biostrings, gmapR,Rsubread, Rbowtie |
| Visualization | ggbio,Gviz |
| Quality assessment | qrqc,seqbias,ReQON, htSeqTools,TEQC, Rolexa,ShortRead |
| RNA-seq | BitSeq,cqn,cummeRbund,DESeq(Anders and Huber 2010),DESeq2[Love:2014p6358],DEXSeq[dexseq],EDASeq,edgeR(M. D. Robinson, McCarthy, and Smyth 2010),gage,goseq, iASeq,tweeDEseq |
| ChIP-seq | BayesPeak,baySeq, ChIPpeakAnno,chipseq, ChIPseqR,ChIPsim, CSAR,DiffBind,MEDIPS, mosaics,NarrowPeaks,nucleR,PICS, PING,REDseq, Repitools,TSSi |
| Motifs | BCRANK, cosmo,cosmoGUI, MotIV,seqLogo, rGADEM |
| 3C | HiTC,r3Cseq |
| Copy number | cn.mops,CNAnorm,exomeCopy, seqgmentSeq |
| Microbiome | phyloseq,DirichletMultinomial(Holmes 2012),clstutils, manta,mcaGUI |
| Workflows | ArrayExpressHTS,Genominator,easyRNASeq,oneChannelGUI,QuasR,rnaSeqMap(Leśniewska and Okoniewski 2011) |
| Database | SRAdb |
R is an open-source statistical programming language. It is used to manipulate data, to perform statistical analysis, and to present graphical and other results. R consists of a core language, additional ‘packages’ distributed with the R language, and a very large number of packages contributed by the broader community. Packages add specific functionality to an R installation. R has become the primary language of academic statistical analysis, and is widely used in diverse areas of research, government, and industry.
R has several unique features. It has a surprisingly ‘old school’ interface: users type commands into a console; scripts in plain text represent workflows; tools other than R are used for editing and other tasks. R is a flexible programming language, so while one person might use functions provided by R to accomplish advanced analytic tasks, another might implement their own functions for novel data types. As a programming language, R adopts syntax and grammar that differ from many other languages: objects in R are ‘vectors’, and functions are ‘vectorized’ to operate on all elements of the object; R objects have ‘copy on change’ and ‘pass by value’ semantics, reducing unexpected consequences for users at the expense of less efficient memory use; common paradigms in other languages, such as the ‘for’ loop, are encountered much less commonly in R. Many authors contribute to , so there can be a frustrating inconsistency of documentation and interface. R grew up in the academic community, so authors have not shied away from trying new approaches. Common statistical analysis functions are very well-developed.
Opening an R session results in a prompt. The user types instructions at the prompt. Here is an example:
## assign values 5, 4, 3, 2, 1 to variable 'x'
x <- c(5, 4, 3, 2, 1)
x
## [1] 5 4 3 2 1
The first line starts with a # to represent a comment; the line is ignored by R. The next line creates a variable x. The variable is assigned (using <-, we could have used = almost interchangeably) a value. The value assigned is the result of a call to the c function. That it is a function call is indicated by the symbol named followed by parentheses, c(). The c function takes zero or more arguments, and returns a vector. The vector is the value assigned to x. R responds to this line with a new prompt, ready for the next input. The next line asks R to display the value of the variable x. R responds by printing [1] to indicate that the subsequent number is the first element of the vector. It then prints the value of x.
R has many features to aid common operations. Entering sequences is a very common operation, and expressions of the form 2:4 create a sequence from 2 to 4. Sub-setting one vector by another is enabled with [. Here we create an integer sequence from 2 to 4, and use the sequence as an index to select the second, third, and fourth elements of x
x[2:4]
## [1] 4 3 2
Index values can be repeated, and if outside the domain of x return the special value NA. Negative index values remove elements from the vector. Logical and character vectors (described below) can also be used for sub-setting.
R functions operate on variables. Functions are usually vectorized, acting on all elements of their argument and obviating the need for explicit iteration. Functions can generate warnings when performing suspect operations, or errors if evaluation cannot proceed; try log(-1).
log(x)
## [1] 1.61 1.39 1.10 0.69 0.00
R has a number of standard data types, to represent integer, numeric (floating point), complex, character, logical (Boolean), and raw (byte) data. It is possible to convert between data types, and to discover the type or mode of a variable.
c(1.1, 1.2, 1.3) # numeric
## [1] 1.1 1.2 1.3
c(FALSE, TRUE, FALSE) # logical
## [1] FALSE TRUE FALSE
c("foo", "bar", "baz") # character, single or double quote ok
## [1] "foo" "bar" "baz"
as.character(x) # convert 'x' to character
## [1] "5" "4" "3" "2" "1"
typeof(x) # the number 5 is numeric, not integer
## [1] "double"
typeof(2L) # append 'L' to force integer
## [1] "integer"
typeof(2:4) # ':' produces a sequence of integers
## [1] "integer"
R includes data types particularly useful for statistical analysis, including factor to represent categories and NA (used in any vector) to represent missing values.
sex <- factor(c("Male", "Female", NA), levels=c("Female", "Male"))
sex
## [1] Male Female <NA>
## Levels: Female Male
All of the vectors mentioned so far are homogeneous, consisting of a single type of element. A list can contain a collection of different types of elements and, like all vectors, these elements can be named to create a key-value association.
lst <- list(a=1:3, b=c("foo", "bar"), c=sex)
lst
## $a
## [1] 1 2 3
##
## $b
## [1] "foo" "bar"
##
## $c
## [1] Male Female <NA>
## Levels: Female Male
Lists can be subset like other vectors to get another list, or subset with [[ to retrieve the actual list element; as with other vectors, sub-setting can use names
lst[c(3, 1)] # another list
## $c
## [1] Male Female <NA>
## Levels: Female Male
##
## $a
## [1] 1 2 3
lst[["a"]] # the element itself, selected by name
## [1] 1 2 3
A data.frame is a list of equal-length vectors, representing a rectangular data structure not unlike a spread sheet. Each column of the data frame is a vector, so data types must be homogeneous within a column. A data.frame can be subset by row or column using [,], and columns can be accessed with $ or [[. In the [,] notation, row subset will be detailed before the comma and column subset afterwards, i.e.: . Either subset can be left empty.
df <- data.frame(age=c(27L, 32L, 19L),
sex=factor(c("Male", "Female", "Male")))
df
## age sex
## 1 27 Male
## 2 32 Female
## 3 19 Male
df[c(1, 3),]
## age sex
## 1 27 Male
## 3 19 Male
df[df$age > 20,]
## age sex
## 1 27 Male
## 2 32 Female
A matrix is also a rectangular data structure, but subject to the constraint that all elements are the same type. A matrix is created by taking a vector, and specifying the number of rows or columns the vector is to represent. On sub-setting, R coerces a single column data.frame or single row or column matrix to a vector if possible; use drop=FALSE to stop this behavior.
m <- matrix(1:12, nrow=3)
m
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
m[c(1, 3), c(2, 4)]
## [,1] [,2]
## [1,] 4 10
## [2,] 6 12
m[, 3]
## [1] 7 8 9
m[, 3, drop=FALSE]
## [,1]
## [1,] 7
## [2,] 8
## [3,] 9
An array is a data structure for representing homogeneous, rectangular data in higher dimensions.
More complicated data structures are represented using the ‘S3’ or ‘S4’ object system. Objects are often created by functions (for example, lm, below), with parts of the object extracted or assigned using accessor functions. The following generates 1000 random normal deviates as x, and uses these to create another 1000 deviates y that are linearly related to x but with some error. We fit a linear regression using a ‘formula’ to describe the relationship between variables, summarize the results in a familiar ANOVA table, and access fit (an S3 object) for the residuals of the regression, using these as input first to the var (variance) and then sqrt (square-root) functions. Objects can be interrogated for their class.
x <- rnorm(1000, sd=1)
y <- x + rnorm(1000, sd=.5)
fit <- lm(y ~ x) # formula describes linear regression
fit # an 'S3' object
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.0184 0.9860
anova(fit)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 988 988 3738 <2e-16 ***
## Residuals 998 264 0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sqrt(var(resid(fit))) # residuals accessor and subsequent transforms
## [1] 0.51
class(fit)
## [1] "lm"
Many Bioconductor packages implement S4 objects to represent data. S3 and S4 systems are quite different from a programmer’s perspective, but fairly similar from a user’s perspective: both systems encapsulate complicated data structures, and allow for methods specialized to different data types; accessors are used to extract information from the objects.
R functions accept arguments, and return values. Arguments can be required or optional. Some functions may take variable numbers of arguments, e.g.: the columns in a data.frame
y <- 5:1
log(y)
## [1] 1.61 1.39 1.10 0.69 0.00
args(log) # arguments 'x' and 'base'; see ?log
## function (x, base = exp(1))
## NULL
log(y, base=2) # 'base' is optional, with default value
## [1] 2.3 2.0 1.6 1.0 0.0
try(log()) # 'x' required; 'try' continues even on error
args(data.frame) # ... represents variable number of arguments
## function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
## fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors())
## NULL
Arguments can be matched by name or position. If an argument appears after ..., it must be named.
log(base=2, y) # match argument 'base' by name, 'x' by position
## [1] 2.3 2.0 1.6 1.0 0.0
A function such as anova is a generic that provides an overall signature but dispatches the actual work to the method corresponding to the class(es) of the arguments used to invoke the generic. A generic may have fewer arguments than a method, as with the S3 function anova and its method anova.glm.
args(anova)
## function (object, ...)
## NULL
args(stats:::anova.glm)
## function (object, ..., dispersion = NULL, test = NULL)
## NULL
The ... argument in the anova generic means that additional arguments are possible; the anova generic hands these arguments to the method it dispatches to. Note that in the previous example, the notation “library:::function” is necessary as the anova.glm function is only present within the stats library namespace (i.e.: the package self-environment) and is not exported to the user environment. See subsection for more information about libraries.
R has a very large number of functions. The following is a brief list of those that might be commonly used and particularly useful.
dir, read.table (and friends), scanList files in a directory, read spreadsheet-like data into , efficiently read homogeneous data (e.g.: a file of numeric values) to be represented as a matrix.
c, factor, data.frame, matrixCreate a vector, factor, data frame or matrix.
summary, table, xtabsSummarize, create a table of the number of times elements occur in a vector, cross-tabulate two or more variables.
t.test, aov, lm, anova, chisq.testBasic comparison of two (t.test) groups, or several groups via analysis of variance / linear models (aov output is probably more familiar to biologists), or compare simpler with more complicated models (anova); chi^2 tests.
dist, hclustCluster data.
plotPlot data.
ls, str, library, searchList objects in the current (or specified) workspace, or peak at the structure of an object; add a library to or describe the search path of attached packages.
lapply, sapply, mapply, aggregateApply a function to each element of a list (lapply, sapply), to elements of several lists (mapply), or to elements of a list partitioned by one or more factors (aggregate).
withConveniently access columns of a data frame or other element without having to repeat the name of the data frame.
match, %in%Report the index or existence of elements from one vector that match another.
split, cutSplit one vector by an equal length factor, cut a single vector into intervals encoded as levels of a factor.
strsplit, grep, subOperate on character vectors, splitting it into distinct fields, searching for the occurrence of a patterns using regular expressions (see ?regex, or substituting a string for a regular expression.
install.packagesInstall a package from an on-line repository into your .
traceback, debug, browserReport the sequence of functions under evaluation at the time of the error; enter a debugger when a particular function or statement is invoked.
?, exampleSee the help pages (e.g.: ?lm) and examples (example(match)) for each of these functions
This example uses data describing 128 microarray samples as a basis for exploring R functions. Covariates such as age, sex, type, stage of the disease, etc., are in a data file .
The following command creates a variable that is the location of a comma-separated value (‘csv’) file to be used in the exercise. A csv file can be created using, e.g.: ‘Save as…’ in spreadsheet software.
pdataFile <- system.file(package="RnaSeqTutorial", "extdata", "pData.csv")
Input the csv file using read.table, assigning the input to a variable . Use dim to find out the dimensions (number of rows, number of columns) in the object. Are there 128 rows? Use names or colnames to list the columns’ name. Use summary to summarize each column of the data. What are the data types of each column in the data frame?
pdata <- read.table(pdataFile)
dim(pdata)
## [1] 128 21
names(pdata)
## [1] "cod" "diagnosis" "sex" "age"
## [5] "BT" "remission" "CR" "date.cr"
## [9] "t.4.11." "t.9.22." "cyto.normal" "citog"
## [13] "mol.biol" "fusion.protein" "mdr" "kinet"
## [17] "ccr" "relapse" "transplant" "f.u"
## [21] "date.last.seen"
summary(pdata)
## cod diagnosis sex age BT
## 10005 : 1 1/15/1997 : 2 F :42 Min. : 5 B2 :36
## 1003 : 1 1/29/1997 : 2 M :83 1st Qu.:19 B3 :23
## 1005 : 1 11/15/1997: 2 NA's: 3 Median :29 B1 :19
## 1007 : 1 2/10/1998 : 2 Mean :32 T2 :15
## 1010 : 1 2/10/2000 : 2 3rd Qu.:46 B4 :12
## 11002 : 1 (Other) :116 Max. :58 T3 :10
## (Other):122 NA's : 2 NA's :5 (Other):13
## remission CR date.cr t.4.11.
## CR :99 CR :96 11/11/1997: 3 Mode :logical
## REF :15 DEATH IN CR : 3 1/21/1998 : 2 FALSE:86
## NA's:14 DEATH IN INDUCTION: 7 10/18/1999: 2 TRUE :7
## REF :15 12/7/1998 : 2 NA's :35
## NA's : 7 1/17/1997 : 1
## (Other) :87
## NA's :31
## t.9.22. cyto.normal citog mol.biol
## Mode :logical Mode :logical normal :24 ALL1/AF4:10
## FALSE:67 FALSE:69 simple alt. :15 BCR/ABL :37
## TRUE :26 TRUE :24 t(9;22) :12 E2A/PBX1: 5
## NA's :35 NA's :35 t(9;22)+other:11 NEG :74
## complex alt. :10 NUP-98 : 1
## (Other) :21 p15/p16 : 1
## NA's :35
## fusion.protein mdr kinet ccr relapse
## p190 :17 NEG :101 dyploid:94 Mode :logical Mode :logical
## p190/p210: 8 POS : 24 hyperd.:27 FALSE:74 FALSE:35
## p210 : 8 NA's: 3 NA's : 7 TRUE :26 TRUE :65
## NA's :95 NA's :28 NA's :28
##
##
##
## transplant f.u date.last.seen
## Mode :logical REL :61 1/7/1998 : 2
## FALSE:91 CCR :23 12/15/1997: 2
## TRUE :9 BMT / DEATH IN CR: 4 12/31/2002: 2
## NA's :28 BMT / CCR : 3 3/29/2001 : 2
## DEATH IN CR : 2 7/11/1997 : 2
## (Other) : 7 (Other) :83
## NA's :28 NA's :35
Let’s move onto the next topic:
[[ or $. Pause to explain to your neighbor why this sub-setting works. Since a data frame is a list, use sapply to ask about the class of each column in the data frame. Explain to your neighbor what this produces, and why.The solution is that a data frame can be subset as if it were a matrix, or a list of column vectors.
head(pdata[,"sex"], 3)
## [1] M M F
## Levels: F M
head(pdata$sex, 3)
## [1] M M F
## Levels: F M
head(pdata[["sex"]], 3)
## [1] M M F
## Levels: F M
sapply(pdata, class)
## cod diagnosis sex age BT
## "factor" "factor" "factor" "integer" "factor"
## remission CR date.cr t.4.11. t.9.22.
## "factor" "factor" "factor" "logical" "logical"
## cyto.normal citog mol.biol fusion.protein mdr
## "logical" "factor" "factor" "factor" "factor"
## kinet ccr relapse transplant f.u
## "factor" "logical" "logical" "logical" "factor"
## date.last.seen
## "factor"
table to summarize the number of males and females in the sample. Consult the help page ?table to figure out additional arguments required to include NA values in the tabulation.The number of males and females, including NA, is
table(pdata$sex, useNA="ifany")
##
## F M <NA>
## 42 83 3
An alternative version of this uses the with function: with(pdata, table(sex, useNA=“ifany”)).
mol.biol column summarizes molecular biological attributes of each sample. Use table to summarize the different molecular biology levels in the sample.The mol.biol column contains the following samples:
with(pdata, table(mol.biol, useNA="ifany"))
## mol.biol
## ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16
## 10 37 5 74 1 1
%in% to create a logical vector of the samples that are either BCR/ABL or NEG.A logical vector indicating that the corresponding row is either BCR/ABL or NEG is constructed as
ridx <- pdata$mol.biol %in% c("BCR/ABL", "NEG")
BCR/ABL or NEG.We can get a sense of the number of rows selected via table or sum (discuss with your neighbor what sum does, and why the answer is the same as the number of TRUE values in the result of the table function).
table(ridx)
## ridx
## FALSE TRUE
## 17 111
sum(ridx)
## [1] 111
mol.biol column? Set the levels to be BCR/ABL and NEG, i.e.: the levels in the subset.The original data frame can be subset to contain only BCR/ABL or NEG samples using the logical vector that we created.
pdata1 <- pdata[ridx,]
The levels of each factor reflect the levels in the original object, rather than the levels in the subset object, e.g.:
levels(pdata$mol.biol)
## [1] "ALL1/AF4" "BCR/ABL" "E2A/PBX1" "NEG" "NUP-98" "p15/p16"
These can be re-coded by updating the new data frame to contain a factor with the desired levels.
pdata1$mol.biol <- factor(pdata1$mol.biol)
table(pdata1$mol.biol)
##
## BCR/ABL NEG
## 37 74
t.test to assess whether BCR/ABL and NEG have individuals with similar age. To do this, use a formula that describes the response age in terms of the predictor mol.biol. If age is not independent of molecular biology, what complications might this introduce into subsequent analysis? Use a boxplot to represent the age depending on the molecular biology.To ask whether age differs between molecular biology types, we use a formula age mol.biol to describe the relationship (‘age as a function of molecular biology’) that we wish to test
with(pdata1, t.test(age ~ mol.biol))
##
## Welch Two Sample t-test
##
## data: age by mol.biol
## t = 5, df = 70, p-value = 8e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.1 17.2
## sample estimates:
## mean in group BCR/ABL mean in group NEG
## 40 28
This summary can be visualized with, e.g.: the boxplot function
boxplot(age ~ mol.biol, pdata1)
For a fancier plot, see the following that uses some additional arguments of the boxplot function.
boxplot(age ~ mol.biol, pdata1,notch=TRUE,ylab="age (yr)",
main="Age distribution by genotype",xlab="genotype")
Packages provide functionality beyond that available in base . There are almost 5,500 packages in CRAN (comprehensive R archive network) and more than 800 Bioconductor software packages. Packages are contributed by diverse members of the community; they vary in quality (many are excellent) and sometimes contain idiosyncratic aspects to their implementation. The following table outlines key base packages and selected contributed packages; see a local CRAN mirror (including the task views summarizing packages in different domains) and Bioconductor for additional contributed packages.
| Package | Description |
|---|---|
base |
Data input and essential manipulation; scripting and programming concepts. |
stats |
Essential statistical and plotting functions. |
lattice, ggplot2 |
Approaches to advanced graphics. |
methods |
‘S4’ classes and methods. |
parallel |
Facilities for parallel evaluation. |
The lattice package illustrates the value packages add to base . lattice is distributed with R but not loaded by default. It provides a very expressive way to visualize data. The following example plots yield for a number of barley varieties, conditioned on site and grouped by year. The following figure is read from the lower left corner. Note the common scales, efficient use of space, and not-too-pleasing default color palette. The Morris sample appears to be mis-labeled for ‘year’, an apparent error in the original data. Find out about the built-in data set used in this example with ?barley.
library(lattice)
plt <- dotplot(variety ~ yield | site, data = barley, groups = year,
xlab = "Barley Yield (bushels/acre)" , ylab=NULL,
key = simpleKey(levels(barley$year), space = "top",
columns=2),
aspect=0.5, layout = c(2,3))
print(plt)
New packages can be added to an R installation using install.packages. A package is installed only once per R installation, but needs to be loaded (with library) in each session in which it is used. Loading a package also loads any package that it depends on. Packages loaded in the current session are displayed with search. The ordering of packages returned by search represents the order in which the global environment (where commands entered at the prompt are evaluated) and attached packages are searched for symbols; it is possible for a package earlier in the search path to mask symbols later in the search path; these can be disambiguated using ::.
length(search())
## [1] 9
search()
## [1] ".GlobalEnv" "package:lattice" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "Autoloads" "package:base"
base::log(1:3)
## [1] 0.00 0.69 1.10
Use the library function to load the RnaSeqTutorial package. Use the sessionInfo function to verify that you are using R version 3.2.0 and current packages, similar to those reported here. What other packages were loaded along with RnaSeqTutorial?
library(RnaSeqTutorial)
sessionInfo()
Find help using the R help system. Start a web browser with:
help.start()
The ‘Search Engine and Keywords’ link is helpful in day-to-day use.
Use manual pages to find detailed descriptions of the arguments and return values of functions, and the structure and methods of classes. Find help within an R session as:
?data.frame
?lm
?anova # a generic function
?anova.lm # an S3 method, specialized for 'lm' objects
S3 methods can be queried interactively. For S3,
methods(anova)
## [1] anova.glm* anova.glmlist* anova.lm* anova.lmlist*
## [5] anova.loess* anova.mlm* anova.nls*
## see '?methods' for accessing help and source code
methods(class="glm")
## [1] add1 anova confint cooks.distance
## [5] deviance drop1 effects extractAIC
## [9] family formula influence logLik
## [13] model.frame nobs predict print
## [17] residuals rstandard rstudent summary
## [21] vcov weights
## see '?methods' for accessing help and source code
It is often useful to view a method definition, either by typing the method name at the command line or, for ‘non-visible’ methods, using getAnywhere:
ls
getAnywhere("anova.loess")
Here we discover that the function head (which returns the first 6 elements of anything) defined in the utils package, is an S3 generic (indicated by UseMethod) and has several methods. We use head to look at the first six lines of the head method specialized for matrix objects.
utils::head
## function (x, ...)
## UseMethod("head")
## <bytecode: 0x7f8f4bdfddf0>
## <environment: namespace:utils>
methods(head)
## [1] head.data.frame* head.default* head.ftable* head.function*
## [5] head.matrix head.table*
## see '?methods' for accessing help and source code
head(head.matrix)
##
## 1 function (x, n = 6L, ...)
## 2 {
## 3 stopifnot(length(n) == 1L)
## 4 n <- if (n < 0L)
## 5 max(nrow(x) + n, 0L)
## 6 else min(n, nrow(x))
S4 classes and generics are queried in a similar way to S3 classes and generics, but with different syntax, as for the complement generic in the Biostrings package:
library(Biostrings)
showMethods(complement)
(Most) methods defined on the DNAStringSet class of Biostrings and available on the current search path can be found with
showMethods(class="DNAStringSet", where=search())
Obtaining help on S4 classes and methods requires syntax such as
class ? DNAStringSet
method ? "complement,DNAStringSet"
The specification of method and class in the latter must not contain a space after the comma. The definition of a method can be retrieved as
selectMethod(complement, "DNAStringSet")
Vignettes, especially in Bioconductor packages, provide an extensive narrative describing overall package functionality. Use
vignette(package="RnaSeqTutorial")
to see, in your web browser, vignettes available in the RnaSeqTutorial package. Vignettes usually consist of text with embedded R code, a form of literate programming. The vignette can be read as a PDF document, while the R source code is present as a script file ending with extension .R. The script file can be sourced or copied into an R session to evaluate exactly the commands used in the vignette.
Use the following to open the vignette of interest:
vignette("RnaSeqTutorial")
Scavenger hunt. Spend five minutes tracking down the following information.
The package containing the library function.
The author of the alphabetFrequency function, defined in the Biostrings package.
A description of the GAlignments class.
The number of vignettes in the GenomicRanges package.
Possible solutions are found with the following R commands
?library
library(Biostrings)
?alphabetFrequency
class?GAlignments
vignette(package="GenomicRanges")
There are often many ways to accomplish a result in R , but these different ways often have very different speed or memory requirements. For small data sets these performance differences are not that important, but for large data sets (e.g.: high-throughput sequencing; genome-wide association studies, GWAS) or complicated calculations (e.g.: bootstrapping) performance can be important. There are several approaches to achieving efficient R programming.
Several common performance bottlenecks often have easy solutions; these are outlined here.
Text files often contain more information, for example 1000’s of individuals at millions of SNPs, when only a subset of the data is required, e.g.: during algorithm development. Reading in all the data can be demanding in terms of both memory and time. A solution is to use arguments such as colClasses to specify the columns and their data types that are required, and to use nrow to limit the number of rows input. For example, the following ignores the first and fourth column, reading in only the second and third (as type integer and numeric).
# not evaluated
colClasses <-
c("NULL", "integer", "numeric", "NULL")
df <- read.table("myfile", colClasses=colClasses)
R is vectorized, so traditional programming for loops are often not necessary. Rather than calculating 100000 random numbers one at a time, or squaring each element of a vector, or iterating over rows and columns in a matrix to calculate row sums, invoke the single function that performs each of these operations.
x <- runif(100000); x2 <- x^2
m <- matrix(x2, nrow=1000); y <- rowSums(m)
This often requires a change of thinking, turning the sequence of operations ‘inside-out’. For instance, calculate the log of the square of each element of a vector by calculating the square of all elements, followed by the log of all elements x2 \<- x2; x3 <- log(x2), or simply logx2 \<- log(x2).
It may sometimes be natural to formulate a problem as a for loop, or the formulation of the problem may require that a for loop be used. In these circumstances the appropriate strategy is to pre-allocate the object, and to fill the result in during loop iteration.
## not evaluated
result <- numeric(nrow(df))
for (i in seq_len(nrow(df)))
result[[i]] <- some_calc(df[i,])
Some R operations are helpful in general, but misleading or inefficient in particular circumstances. An example is the behavior of unlist when the list is named – R creates new names that have been made unique. This can be confusing (e.g.: when Entrez gene identifiers are ‘mangled’ to unintentionally look like other identifiers) and expensive (when a large number of new names need to be created). Avoid creating unnecessary names, e.g.:
unlist(list(a=1:2)) # name 'a' becomes 'a1', 'a2'
## a1 a2
## 1 2
unlist(list(a=1:2), use.names=FALSE) # no names
## [1] 1 2
Names can be very useful for avoiding book-keeping errors, but are inefficient for repeated look-ups; use vectorized access or numeric indexing.
Several solutions to inefficient code require greater knowledge to implement.
Using appropriate functions can greatly influence performance; it takes experience to know when an appropriate function exists. For instance, the lm function could be used to assess differential expression of each gene on a microarray, but the limma package implements this operation in a way that takes advantage of the experimental design that is common to each probe on the microarray, and does so in a very efficient manner.
## not evaluated
library(limma) # microarray linear models
fit <- lmFit(eSet, design)
Using appropriate algorithms can have significant performance benefits, especially as data becomes larger. This solution requires moderate skills, because one has to be able to think about the complexity (e.g.: expected number of operations) of an algorithm, and to identify algorithms that accomplish the same goal in fewer steps. For example, a naive way of identifying which of 100 numbers are in a set of size 10 might look at all 100 combinations of numbers (i.e.: polynomial time), but a faster way is to create a ‘hash’ table of one of the set of elements and probe that for each of the other elements (i.e.: linear time). The latter strategy is illustrated with
x <- 1:100; s <- sample(x, 10)
inS <- x %in% s
R is an interpreted language, and for very challenging computational problems it may be appropriate to write critical stages of an analysis in a compiled language like C or Fortran, or to use an existing programming library (e.g.: the BOOST graph library) that efficiently implements advanced algorithms. R has a well-developed interface to C or Fortran, so it is ‘easy’ to do this. This places a significant burden on the person implementing the solution, requiring knowledge of two or more computer languages and of the interface between them.
When trying to improve performance, one wants to ensure (a) that the new code is actually faster than the previous code, and (b) both solutions arrive at the same, correct, answer.
The system.time function is a straight-forward way to measure the length of time a portion of code takes to evaluate. Here we see that the use of apply to calculate row sums of a matrix is much less efficient than the specialized rowSums function.
m <- matrix(runif(200000), 20000)
replicate(5, system.time(apply(m, 1, sum))[[1]])
## [1] 0.027 0.024 0.023 0.020 0.027
replicate(5, system.time(rowSums(m))[[1]])
## [1] 0.001 0.001 0.001 0.000 0.001
Usually it is appropriate to replicate timings to average over vagaries of system use, and to shuffle the order in which timings of alternative algorithms are calculated to avoid artifacts such as initial memory allocation.
Speed is an important metric, but equivalent results are also needed. The functions identical and all.equal provide different levels of assessing equivalence, with all.equal providing ability to ignore some differences, e.g.: in the names of vector elements.
res1 <- apply(m, 1, sum)
res2 <- rowSums(m)
identical(res1, res2)
## [1] TRUE
identical(c(1, -1), c(x=1, y=-1))
## [1] FALSE
all.equal(c(1, -1), c(x=1, y=-1),
check.attributes=FALSE)
## [1] TRUE
Two additional functions for assessing performance are Rprof and tracemem; these are mentioned only briefly here. The Rprof function profiles R code, presenting a summary of the time spent in each part of several lines of R code. It is useful for gaining insight into the location of performance bottlenecks when these are not readily apparent from direct inspection. Memory management, especially copying large objects, can frequently contribute to poor performance. The tracemem function allows one to gain insight into how R manages memory; insights from this kind of analysis can sometimes be useful in restructuring code into a more efficient sequence.
R signals unexpected results through warnings and errors. Warnings occur when the calculation produces an unusual result that nonetheless does not preclude further evaluation. For instance log(-1) results in a value NaN (‘not a number’) that allows computation to continue, but at the same time signals an warning
> log(-1)
[1] NaN
Warning message:
In log(-1) : NaNs produced
Errors result when the inputs or outputs of a function are such that no further action can be taken, e.g.: trying to take the square root of a character vector
> sqrt("two")
Error in sqrt("two") : Non-numeric argument to mathematical function
Warnings and errors occurring at the command prompt are usually easy to diagnose. They can be more enigmatic when occurring in a function, and exacerbated by sometimes cryptic (when read out of context) error messages.
An initial step in coming to terms with errors is to simplify the problem as much as possible, aiming for a ‘reproducible’ error. The reproducible error might involve a very small (even trivial) data set that immediately provokes the error. Often the process of creating a reproducible example helps to clarify what the error is, and what possible solutions might be.
Invoking traceback() immediately after an error occurs provides a ‘stack’ of the function calls that were in effect when the error occurred. This can help understand the context in which the error occurred. Knowing the context, one might use debug to enter into a browser (see ?browser) that allows one to step through the function in which the error occurred.
It can sometimes be useful to use global options (see ?options) to influence what happens when an error occurs. Two common global options are error and warn. Setting error=recover combines the functionality of traceback and debug, allowing the user to enter the browser at any level of the call stack in effect at the time the error occurred. Default error behavior can be restored with options(error=NULL). Setting warn=2 causes warnings to be promoted to errors. For instance, initial investigation of an error might show that the error occurs when one of the arguments to a function has value NaN. The error might be accompanied by a warning message that the NaN has been introduced, but because warnings are by default not reported immediately it is not clear where the NaN comes from. warn=2 means that the warning is treated as an error, and hence can be debugged using traceback, debug, and so on.
Additional useful debugging functions include browser, trace, and setBreakpoint.
As already mentioned, help can be sought from mailing lists. For question with regards to R, adress them to the:
R Mailing lists: http://www.r-project.org For Bioconductor questions, adress them to the:
Bioconductor Mailing lists: http://bioconductor.org/help/mailing-list/
. Providing this information will fasten the help and the issue resolution if there’s one.
Dalgaard (Dalgaard 2008) provides an introduction to statistical analysis with .
Kabaloff (Kabacoff 2010) provides a broad survey of .
Matloff (Matloff 2011) introduces R programming concepts.
Chambers (Chambers 2008) provides more advanced insights into .
Gentleman (Gentleman 2008) emphasizes use of R for bioinformatic programming tasks.
The R web site enumerates additional publications from the user community.
The RStudio environment provides a nice, cross-platform environment for working in .
Bioconductor http://bioconductor.org
Quick-R http://www.statmethods.net
R web site http://r-project.org
Note that in the following we will refer to any of the 17 samples as “read”, so if you selected sample 01, “read” means “202_subset” for you.
The first pre-processing step is to assess the quality of the raw data received from the sequencing facility. This data is commonly delivered in FASTQ format (Cock et al. 2009).
Upon receiving the RNA-Seq FASTQ files from the sequencing facility, it is essential that some initial QC assessments be performed. Most importantly, one should check the overall sequence quality, the GC percentage distribution ( the proportion of guanine and cytosine bp across the reads) and the presence / absence of overrepresented sequences. FastQC has become a de-facto standard for performing this task http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. A FastQC command line example is as follows:
# First, make an output directory called "qa" and a sub-directory of
# that called "raw"
mkdir -p qa/raw
# Then make copies of your sample files into your home folder. Always try to
# work on copies of the raw source data.
cp share/Day1/data/fastq/read_1.fq.gz .
cp share/Day1/data/fastq/read_2.fq.gz .
# Then you can run fastqc on the file(s) you have selected
fastqc -o qa/raw read_1.fq.gz
The output of FastQC is a zip archive and an HTML document, which is sub-divided into sections describing the specific metrics that were analyzed. These sections are detailed below. In order to view the html page that was created, go to the course webpage and click on “Connect to Apache2” , then click on home > qa > raw and finally on the html file.
Most metrics within this section are self-explanatory. For PE reads, the total number of sequences should match between the forward and reverse read files. It is good practice to take note of the FASTQ Phred encoding, as some downstream tools require the user to specify whether Phred64 or Phred33 encoding should be used. Finally, the %GC should lie within the expected values for the sample species.
The Phred scale quality represents the probability p that the corresponding base call is incorrect. A Phred score Q is an integer mapping of p where: \[ Q = -10 \cdot log10(p) \] Briefly, a Phred score of 10 corresponds to one error in every 10 base calls or 90% accuracy; a Phred score of 20 to one error in every 100 base calls or 99% accuracy. The maximum Phred score is 40 (41 for Illumina version 1.8+ encoding). See http://en.wikipedia.org/wiki/FASTQ_format#Quality for more details on the quality and http://en.wikipedia.org/wiki/FASTQ_format#Encoding for information on the corresponding encoding.
The second FastQC section details the Phred scaled quality as a function of the position in the read. It is very common to observe a quality decrease as a function of the read length (Figure 2) and this pattern is often more pronounced for read2 than it is for read1; this is due to cumulative stochastic errors of the sequencing progresses, largely as a result of the enzyme “tiring out”, and the increasing likelihood that a read cluster becomes out of sync, for example.
Example of Quality Control reports at different stage of the pipeline. A. The “Per sequence GC content” of the raw data. B. The same data shown in A but after rRNA filtering. C. “Per base sequence content” of the raw data. D. The same data after quality-based trimming has been performed.
This section details the quality distribution at the read level, in contrast to the quality per base position of the previous section. If the data is of good quality, the histogram will be skewed to the right.
In this section, the average proportion of individual bases (A, C, G and T) is plotted as a line across the length of the reads. The 12 first bases often show a bias that is characteristic of Illumina RNA-Seq data. This is in contrast with the DNA-Seq protocol, which does not show the same bias. The difference between protocols lies in three additional steps performed during the conversion of mRNA to cDNA, which is subsequently sequenced as if it were genomic DNA. Several hypotheses have been proposed as to the cause of this bias: during reverse transcription of the captured cDNA, random hexamer primers are used and these may introduce a positional bias of the reads; artifacts from end repair; and possibly a tenuous sequence specificity of the polymerase may each play a role either singularly in, most likely, in combination.
Similar to the previous section, the GC content is shown as a function of the position in the read. As previously observed, a bias for the first base pairs (once more in contrast to DNA sequencing data) will often be observed. In addition, for non-strand specific RNA-Seq data, the amount of G and C and of A and T should be similar, as an average, at any position within reads. Moreover the proportion of G+C should match the expected GC content of the sample. For strand-specific data, if the RNA was selected using poly-dT beads, enrichment for T over A may be observed.
The plot in this section (see Figure 2 for an example) represents the distribution of GC content per read, where the data (red curve) is expected to approximately follow the theoretical distribution (blue curve). If the curve presents a shoulder in a region of high GC content, this is usually an indication that rRNA is present in the sample. However, it may also represent contamination by an organism with a higher GC content (such as bacteria or fungi). In contrast, a peak on the left hand side would indicate enrichment for A/T rich sequences. In particular a sharp peak for very low GC content (in the 0-3 range) is usually indicative of the sequencing of the mRNA poly-A tails. If this plot still shows issues after quality and rRNA filtering, additional steps would have to be taken to filter contaminants.
A comparison of the “theoretical” and “observed” GC distribution, the blue and read lines of FastQC “Per sequence GC content”. A. Examples of “observed” GC distribution with a poly-A enrichment (green), rRNA enrichment (red) or no (black) bias. B. The corresponding “theoretical” curve that FastQC would devise from such read GC content distribution.
This plot shows the fraction of indistinguishable bases as a function of the base position in the reads. In high quality sequence data this is expected to be close to zero. Deviations from the expected values indicate problems during the sequencing.
This section shows the distribution of read lengths. Prior to trimming, there should be exactly one peak located at the sequenced read length.
This plot represents the level of duplicate sequences in the library. FastQC assumes that the library is diverse, with even representation of all sequences, it assumes a uniform coverage as would usually be obtained for DNA-Seq experiments. However, this assumption is not valid for RNA-Seq libraries, which have a large dynamic range, possibly containing a million fold difference between lowly and highly expressed genes. As a result it is common to observe high duplication levels for sequences originating from highly expressed genes. It is worth noting that before version 0.11 of FastQC, all duplication levels >= 10 were aggregated into a single bin. In more recent version this has been made more comprehensive in order to provide a more accurate representation of the data.
This table shows sequences that are present at unusually large frequency in the reads. These are most commonly sequencing adapters and will be identified as such. If unidentified sequences are detailed these may originate from rRNA or other contaminants, in which case contaminant filtering should be considered. Often a BLAST search of the unidentified sequence using the NCBI nt database will be informative.
The final plot of the FastQC report details the occurrence of kmers - nucleotide sequences of fixed k length - that were present at a higher than expected frequency as a function of their position within the read. Commonly, the early bases show the aforementioned Illumina sequencing bias (see section d), whereas the last bases may show enrichment for sequencing adapters.
Run FastQC for your assigned read file. Compare the results with that of your teammate. Does the QC looks fine to you or are there issues you would address?
Take a comparative look at samples 202, 207 and 213. Do these 3 samples have identical qualities? If not, what differentiates them? What does the sample you have analysed look like most among these 3?
There are 4 samples that have differing %GC characteristics as well as possibly some over-represented sequences: samples 202, 213.1, 303 and 310.3. All samples otherwise show some adapter contamination and the usual average quality drop as sequencing cycles progresses. Both these issues can be remediated and this is what will be addressed in the following.
Typically, wet-lab protocols to extract mRNA include a step to deplete the sample of rRNA or to enrich it for poly-adenylated transcripts (rRNA is not poly-adenylated). Common approaches to achieve this are to use RiboMinusTM kits (Life Technologies) or poly-dT beads, respectively or to include a precipitation step that selectively precipitates only long (usually >200 bp) nucleotidefragments. No approach will be fully sensitive and, as a result, some rRNA carryover is to be expected. This is not a problem per se as long as the remaining proportion accounts for a low percentage of the reads (commonly between 0.1 and 3%). Larger proportions will have an effect on the usable number of reads obtained from the samples, as fewer sequence reads would originate from expressed mRNAs. This is not to be overlooked as these rRNAs will produce valid alignments (in all reference genomes and for most de novo assembled transcriptomes and genomes) and hence other metrics (such as the alignment rate) will fail to identify such a bias. To control for the rRNA quantity in our sample FastQ files, we use SortMeRna, a tool originally developed to identify rRNA in metagenomics analyses (Kopylova, Noé, and Touzet 2012). The tool accepts FASTQ files (SE or PE) as input and includes a set of reference libraries containing the most common rRNAs (5,5.8,16, 18, 23 and 26-28S). Example command lines for a PE sample are:
# First uncompress your forward and reverse reads:
gunzip read_1.fq.gz read_2.fq.gz
# Merge forward and reverse reads into a single file (this is a requirement
# of the sortmerna software)
merge-paired-reads.sh read_1.fq read_2.fq read-interleaved.fq
# Run sortmerna
sortmerna --ref $SORTMERNA_DB --reads read-interleaved.fq --aligned \
read-rRNA-hits --other read-sortmerna --log -a 4 -v --paired_in --fastx
# Split the file again in two separate files, one for forward, one for reverse
# orientation
unmerge-paired-reads.sh read-sortmerna.fq read-sortmerna_1.fq read-sortmerna_2.fq
# Delete the interleaved files and compress the result files to save space
rm read-interleaved.fq read-sortmerna.fq
gzip read-sortmerna_1.fq read-sortmerna_2.fq
The format conversion step is not required for SE samples, nor is the “–paired_in” argument. The SORTMERNA_DB environment variable needs to be set at installation and the “-a” argument details the number of CPUs/threads to be used. The tool manual provides a comprehensive description of all functionalities. SortMeRna does not currently support compressed input, hence the first and last step to (de)compress the data.
Run SortMeRna on your sample (both read_1 and read_2). This may take a while (about 10 mins). Once done, look at the produced log file.
Next, we perform the QC on the filtered data.
The filtered data is again subjected to a QC assessment by FastQC to ensure the validity of the filtering steps.
Redo the FastQC on the SortMeRna files you have generated.
The GC content plot should show the biggest change, now fitting more closely to the theoretical distribution, as shown in Figure 2A and Figure 2B, which represent the raw and filtered GC content respectively. Shoulders, which were present in regions of higher GC content, should be noticeably smaller or be absent altogether. rRNA overrepresented sequences should no longer be identified in the corresponding table of over-represented sequences. Finally, the theoretical GC curve should be centered closer to the expected GC value of the sample organism.
It is a fact that on Illumina sequencers, the quality of a base pair is linked to its position in the read, bases in the later cycles of the sequencing process have a lower average quality than the earliest cycles (as was observed in the QC report above). This effect depends on the sequencing facility and on the chemistry used and it is only recently that sequencing aligners have integrated methods to correct for this - and not all alignment software does so. A common approach to increase the mapping rate of reads is to trim (remove) low quality bases from the 3’ end until the quality reaches a user-selected Phred-quality threshold. A threshold of 20 is widely accepted as it corresponds to a base call error of 1 in a 100, which is approximately the inherent technical error rate of the Illumina sequencing platform.
An additional issue encountered with Illumina sequencing is the presence of partial adapter sequences within sequenced reads. This arises when the sample fragment size has a large variance and fragments shorter than the sequencer read-length are sequenced. As the resulting reads may contain a significant part of the adapter - a bp sequence of artificial origin - earlier generation alignment software ( those that do not use Maximum Exact Matching and require global alignment of an entire read) may not be able to map such reads. Being able to identify the adapter-like sequences at the end of the read and clip/trim them - a process termed adapter removal - may consequently significantly increase the aligned read proportion.
There are numerous tools available to perform either or both of these tasks (quality trimming and adapter removal). Here, we exemplify using Trimmomatic (Bolger, Lohse, and Usadel 2014), a tool that does both. The command line for our PE sample example is as follows:
trimmomatic PE -threads 4 -phred64 read-sortmerna_1.fq.gz \
read-sortmerna_2.fq.gz read-sortmerna-trimmomatic_1.fq.gz \
read-sortmerna-unpaired_1.fq.gz read-sortmerna-trimmomatic_2.fq.gz \
read-sortmerna-unpaired_2.fq.gz \
ILLUMINACLIP:"/usr/share/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa":2:30:10 \
SLIDINGWINDOW:5:20 MINLEN:50
Run Trimmomatic on your filtered FASTQ data.
Once again what would you do after having quality-trimmed and removed the sequencing adapter from your reads?
Yes! Quality Control! As in the next section.
A final FastQC run is performed to ensure that the previous quality trimming and/or adapter removal steps successfully conserved high quality reads without being too stringent and without introducing any newly apparent technical biases.
Redo the FastQC on the Trimmomatic files you have generated.
Several changes should be observed in comparison with the previous QC report: first, the per-base quality scores should be noticeably different. As shown in Figure 2C,D the per-sequence quality distribution is now shifted towards higher scores (the trimming effect) and sequencing adapters are no longer identified as over-represented (the adapter removal effect). If over-represented sequences remain, this indicates that an additional kind of contamination may be present and should be investigated.
Once the raw read quality has been assessed and determined to be sufficient, or the data has been filtered and trimmed to acceptable standards, the reads can be aligned to a reference. This process is an extremely active field of research and novel aligners are frequently published. There is, sadly, no “silver bullet” and the choice of aligners will be dependent on the reference to be used (genome or transcriptome), the data type (short vs. longer reads) and the available computational power, among other factors. Most recent aligners use either BWT (Burrows M 1994) (Burrows-Wheeler transformation) or MEM (Khan et al. 2009) (Maximum Exact Matches) to perform alignment. Older generation alignment algorithms relied on a spliced-seed approach (Heng Li and Homer 2010). The numerous implementations of these different strategies all come with a myriad of options that may significantly affect the alignment outcome. Selecting the most accurate aligner and determining the optimal parameter set for a project can often represent a small project in itself. At the time of writing this guide there was no guideline available as to which aligner is most appropriate for a given situation (read length, type of reference, etc.). Hence, in the following, we exemplify using aligners that we have incorporated in our processing pipeline based on internal benchmarking for our most common experimental setup: tree genome / transcriptome, Illumina HiSeq 2500, 101bp PE sequencing. The aligner of choice varies based on the type of reference available for the project: For genome based alignment of RNA-Seq data we use STAR, a MEM based aligner - it actually uses MMP (maximum mappable prefix, a variation of MEM); for alignment of RNA-Seq data to a reference transcriptome (Dobin et al. 2013) we use either bowtie (version 1, BWT FM-index based, (Langmead et al. 2009)) or the BWT FM-index or MEM implementations of BWA(H. Li and Durbin 2010), (H. Li and Durbin 2009).
NOTE: The command line below is shown for exemplary purposes, NOT to be executed :-)
First, the genome needs to be indexed. This is performed using the following command:
STAR --runMode genomeGenerate --genomeDir GENOME/Indices --genomeFastaFiles \
GENOME/FASTA/genome.fa --runThreadN 8 --sjdbOverhang 100 \
--sjdbGTFfile GENOME/GTF/genome.gtf
where the genomeDir parameter specifies the output directory for the index, genomeFastaFiles specifies the genome FASTA file path and sjdbGTFfile the file path of the gene annotation file, which can typically be retrieved from EnsEMBL (in gtf format) or UCSC (in gff3 format, to be converted in gtf format). We also provide an additional option that would need to be edited depending on your sequencing read length (sjdbOverhang 100); we selected 100 as our longuest reads are 101bp long - see the STAR manual for the rationale behind this.
Once the genome index is built, we can align our sample reads to it. This is achieved as follows:
mkdir star
STAR --genomeDir share/Day1/data/indices/STAR --readFilesIn read-sortmerna-trimmomatic_1.fq.gz \
read-sortmerna-trimmomatic_2.fq.gz --runThreadN 4 --alignIntronMax 11000 --outSAMstrandField intronMotif \
--sjdbGTFfile share/Day1/data/reference/gff/Ptrichocarpa_v3.0_210_synthetic-gene-models-wo-introns.gff3 \
--readFilesCommand zcat --outFileNamePrefix star/read-sortmerna-trimmomatic-STAR --outSAMmapqUnique 254 \
--outQSconversionAdd -31 --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM --outFilterMultimapNmax 100 --chimSegmentMin 1 --outWigType bedGraph \
--sjdbGTFtagExonParentTranscript Parent
where there are a number of additional parameters: --alignIntronMax is important to specify so that STAR does not try to align split reads across a distance greater than --alignIntronMax bp, reads that span an exon-exon junction (EEJ) only need to span at most the longest intron in your genome. Note that the largest intron in P. trichocarpa is about 11kb long. The parameter --outFileNamePrefix sets the path and prefix to where the results will be written (note that from now on, as the reads have been combined into a single result file, we refer to our exemplary data as “sample”). We provide a few additional parameters that may require adjustment based on your data:
Our sample files are GZip compressed so we inform STAR how to read it (readFilesCommand zcat). As our files were generated using the Illumina v1.5 FASTQ format, we convert them into Sanger FASTQ (outQSconversionAdd -31) and finally we specify that STAR should output unmapped reads separately (outReadsUnmapped Fastx). The outSAMtype BAM SortedByCoordinate ensures that the final result is already converted to the “BAM” format and sorted by coordinate, saving a few additional steps doing this via samtools. As you can see, we provide many more arguments, which you can lookup in the STAR manual; STAR is highly configurable!
STAR returns a number of result files:
a read-sortmerna-trimmomatic-STARAligned.sortedByCoord.out.bam SAM file that contains the alignment in SAM format (Li et al, 2009).
two FASTQ files containing the forward and reverse unmapped reads:
a number of sample-sortmerna-trimmomatic-STARLog.* log files
a number of sample-sortmerna-trimmomatic-SJ.* files containing splice junction information.
The BAM format is very efficient for computers to work with, but as a binary file format, it is unreadble to humans. If you would like to take a look at the alignments, you can do so using samtools:
samtools view star/read-sortmerna-trimmomatic-STARAligned.sortedByCoord.out.bam | head
The sorted BAM file is then indexed
samtools index star/read-sortmerna-trimmomatic-STARAligned.sortedByCoord.out.bam
Furthermore, the FASTQ files containing unaligned reads are renamed to and and are compressed.
The log files, which contain information relating to the processing and splice-junctions, are moved into a log directory.
mkdir star/sample-sortmerna-trimmomatic-STAR\_logs
mv star/sample-sortmerna-trimmomatic-STARLog.* \
star/sample-sortmerna-trimmomatic-SJ.* \
star/sample-sortmerna-trimmomatic-STAR\_logs
Among the log files, Log.final.out and SJ.out.tab are of particular interest. The first details the alignment rate, percentage of uniquely/multiple aligning reads, rate of mismatches, INDELs identified in the reads, etc. The second file describes, in a tabular format, all the EEJs identified by STAR and whether these exist in the provided gff3 file or if they are novel. This is an extremely useful resource that can be used to identify possible new transcript splice variants. One needs to keep in mind that transcription, as all biological processes, is a stochastic process and as such, there will be miss-spliced transcripts present at a low frequency in any RNA-Seq sample that has been sequenced to adequate depth. Hence novel identified junctions might represent low-frequency genuine transcription as well as noise.
Kallisto is a fairly recent program that makes heavy use of k-mer hash tables and De-Bruijn graphs to “pseudo-align” reads (Bray et al. 2016). This means that Kallisto does not map a read to a given position in a genome/transcriptome, but rather to an abstract representation of a transcript. Kallisto then performs a likelihood estimation and expectation maximization algorithms to calculate a count and - at request - bootstraps a confidence value.
Step 1 is to create the Kallisto index from a transcriptome:
kallisto index -i Ptrichocarpa_v3.0_210_synthetic-gene-models.idx \
share/Day1/data/reference/fasta/Ptrichocarpa_v3.0_210_synthetic-gene-models.fa
We could modify the k-mer size using the -k option. This is - however - often a very sensitive parameter in k-mer/hash based applications. Lower values will provide higher sensitivity, but lower specificity and vice versa for higher values of k. The optimal choice for a k-mer would be the longest read length without a single mismatch due to natural polymorphisms or technical artefacts. In practise, this is very difficult to estimate unless you have access to a large pool of population genomics data.
Next we can quantify the input data using this newly created index:
kallisto quant -i Ptrichocarpa_v3.0_210_synthetic-gene-models.idx -b 100 \
-o . --pseudobam read-sortmerna-trimmomatic_1.fq.gz read-sortmerna-trimmomatic_2.fq.gz \
> sample-sortmerna-trimmomatic_pseudo.sam
The option -b 100 performs a bootstrap confidence estimation using 100 iterations. --pseudobam tells Kallisto to output a SAM file of pseudoalignments to stdout.
Kallisto outputs several files:
The SAM file can be converted to a sorted BAM and indexed:
samtools view -bT Ptrichocarpa_v3.0_210_synthetic-gene-models.fa \
sample-sortmerna-trimmomatic_pseudo.sam | samtools sort -o ${f/.sam/.bam} \
-T PREF && samtools index sample-sortmerna-trimmomatic_pseudo.bam
Visualising alignments can prove useful to assess the data quality or the aligners performance. Let us take a look at the STAR results. We already copied them in a place that JBrowse (the online Genome Browser we will be using, which you can access from the course webpage, by clicking on your “Connect to Apache2” link) can access.
The following six chapters aim at “taking a look under the hood” of how feature summarisation i.e. creating a count table by overlapping the read alignment and the genic annotation information; functions. At the same time, we will toy with computer optimisation (parallel processing and memory management). This will introduce you to an efficient and practical R usage.
Before experimenting with the Bioconductor packages functionalities that were presented in the lecture, we will first sublimate the knowledge you’ve gathered so far into adressing the computational challenges faced when using HTS data: i.e.: resources and time consumption.
In the lecture, the readGAlignments function from the GenomicAlignments package was introduced and used to extract a GAlignments object. However, most of the times, an experiment will NOT consist of a single sample (of about a million reads only!) and an obvious way to speed up the process is to parallelize. In the following three sections, we will see how to perform this before ultimately discussing the pros and cons of the implemented method.
First of all, locate the BAM files and implement a function to read them sequentially. Have a look at the lapply function man page for doing so.
library(GenomicAlignments)
bamfiles <- dir(file.path(extdata(),"BAM"),
pattern="*.bam$",full.names=TRUE)
gAlns <- lapply(bamfiles,readGAlignments)
Nothing complicated so far, right? We proceed both files sequentially and get a list of GAlignments objects stored in the gAlns object. Apart from the coding enhancement - with one line, we can process all our samples - there is no other gains.
Modern laptop possess commonly 2 multi-core CPUs that can perform tasks independently. Computational servers usually have many CPUs (commonly 8) each having several cores. An obvious enhancement to our previous solution is to take advantage of this CPU architecture and to process our samples in parallel.
Have a look at the parallel package and in particular at the mclapply function to re-implement the previous function in a parallel manner.
library(parallel)
gAlns <- mclapply(bamfiles,readGAlignments)
It is NOT because there were 2 files to proceed. The mclapply has a number of default parameters - see ?mclapply for details - including the one that defaults to 2. If you want to proceed more samples in parallel, set that parameter value accordingly.
This new implementation has the obvious advantage to be X times faster (with X being the number of CPU used, or almost so as parallelization comes with a slight overhead cost), but it put a different strain on the system. As several files are being processed in parallel, the memory requirement also increase by a factor X (assuming files of almost equivalent size are to be processed). This might be fine on a computational server but given the constant increase in sequencing reads being produced per run, this will eventually be challenged.
Can you think of the way this memory issue could be adressed? i.e.: what could we modify in the way we read/process the file to limit the memory required at a given moment?
No, buying more memory is usually not an option. And anyway, at the moment, the increase rate of reads sequenced per run is faster than the memory doubling time. So, let us just move to the next section to have a go at adressing the issue.
To limit the memory required at any moment, one approach would be to proceed the file not as a whole, but chunk-wise. As we can assume that reads are stored independently in BAM files (or almost so, think of how Paired-End data is stored!), we simply can decide to parse, e.g.: 1,000,000 reads at a time. This will of course require to have a new way to represent a BAM file in R, i.e.: not just as a character string as we had it until now in our bamfiles object.
The Rsamtools package again comes in handy. Lookup the ?BamFile help page and try to scheme how we could take advantage of the BamFile or BamFileList classes for our purpose. As the files are small, use a yieldSize of a 100,000 reads.
The parameter of either class looks like exactly what we want. Let us recode our character object into a BamFileList.
library(Rsamtools)
bamFileList <- BamFileList(bamfiles,yieldSize=10^5)
Now that we have the BAM files described in a way that we can process them chunk-wise, let us do so. The paradigm is as follow: Note: it is just that: a paradigm, so DO NOT RUN IT but rather inspire yourself from it.
## DO NOT RUN ME!
## I AM JUST AN EXAMPLE
open(bamFile)
while(length(chunk <- readGAlignmentsFromBam(bamFile))){
message(length(chunk))
}
close(bamFile)
In the paradigm above, we process one BAM file chunk wise and report the sizes of the chunks. i.e.: these would be 0.1M reads - in our case - apart for the last one, which would be smaller or equal to 0.1M (it is unlikely that a sequencing file contains an exact multiple of our chunk size).
Now, try to implement the above paradigm in the function we implemented previously - remember the solution with mclapply previously - so as to process both our BAM files in parallel and chunk-wise.
gAlns <- mclapply(bamFileList,function(bamFile){
open(bamFile)
gAln <- GAlignments()
while(length(chunk <- readGAlignments(bamFile))){
gAln <- c(gAln,chunk)
}
close(bamFile)
return(gAln)
})
Before reading my comments below, take the time to jot down what you think are the advantages and drawbacks of the method implemented above. My own comments below are certainly not extensive and I would be curious to hear yours that are not matched with mine.
We have written a streamlined piece of code, using up to date functionalities from other packages. Hence, it is both easily maintainable and updatable.
With regards to time consumption, we have reduced it by a factor 2 and that can be reduced further by using computer with more CPUs or a compute farm even - obviously if we have more than \(2\) samples to process.
We have implemented the processing of the BAM files by chunk
In the former chapter, we have had an initial look at computer optimisation. In the present chapter, we will look into how read alignments are commonly manipulated and what are the principles followed (and caveats faced) by standard tools.
Most down-stream analysis of short read sequences is based on reads aligned to reference genomes. There are many aligners available, including BWA (H. Li and Durbin 2010),(H. Li and Durbin 2009), Bowtie2 (Langmead et al. 2009), GSNAP (Wu and Watanabe 2005), STAR (Dobin et al. 2013); merits of which are discussed in the literature. There are also alignment algorithms implemented in Bioconductor (e.g., matchPDict in the Biostrings package and the gmapR, Rbowtie, Rsubread packages); matchPDict is particularly useful for flexible alignment of moderately sized subsets of data.
The following sections introduce core tools for working with high-throughput sequence data; key packages for representing reads and alignments are summarized in the following table.
| Package | Description |
|---|---|
ShortRead |
In addition to functionalities to manipulate raw read files, e.g.: the ShortReadQ class and functions for manipulating fastq files; this package offers the possibility to load numerous HTS formats classes. These are mostly sequencer manufacturer specific e.g.: sff for 454 or pre-BAM aligner proprietary formats, e.g.: MAQ or bowtie. These functionalities rely heavily on Biostrings and somewhat on Rsamtools. |
GenomicAlignments |
GAlignments and GAlignmentPairs store single- and paired-end aligned reads and extend on the GenomicRanges functionalities. |
Rsamtools |
Provides access to BAM alignment and other large sequence-related files. |
rtracklayer |
Input and output of bed, wig and similar files. |
Read the man page of the GAlignments and GAlignmentPairs classes and pay attention to the very important comments on multi-reads and paired-end processing.
Really just ?GAlignments. However, KEEP these details in mind as they are essential and likely source of erroneous conclusions. Remember the example of this morning lecture about RNA editing.
ShortRead packageEarlier, you looked at the ShortRead package to manipulate raw reads and to perform Quality Assessment (QA) on raw data files e.g.: fastq formatted files. These are not the only functionalities from the ShortRead package, which offers as well the possibility to read in alignments files in many different formats.
Two files of the pasilla dataset have been aligned using bowtie (???), locate them in the extdata folder of the RnaSeqTutorial package.
bwtFiles <- dir(path=file.path(extdata(),"bowtie"),
pattern="*.bwt$",full.names=TRUE)
Have a pick at one of the file and try to decipher its format. Hint: it is a tab delimited format, so check the read.delim function. As you may not want to read all the lines to get an idea, lookup an appropriate argument for that.
You might want to check the bowtie manual for checking whether your guesses were correct. Here is how to read 10 lines of the first file.
read.delim(file=file.path(extdata(),"bowtie","SRR074430.bwt"),
header=FALSE,nrows=10)
now, as was presented in the lecture use the readAligned function to read in the bowtie alignment files.
alignedRead <- readAligned(dirPath=file.path(extdata(),"bowtie"),
pattern="*.bwt$",type="Bowtie")
What is peculiar about the returned object? Determine its class. Can you tell where the data from both input files are?
We obtained a single object of the AlignedRead class. By looking at the documentation, i.e.: ?readAligned in the Value section, we are told that all files are concatenated in a single object with NO guarantee of order in which files are read. This is convenient when we want to merge several sequencing runs of the same sample but we need to be cautious and process independent sample by individually calling the readAligned function for every sample.
Finally, taking another look at the lecture, select only the reads that align to chromosome 2L. Hint, use the appropriate SRFilter filter.
alignedRead2L <- readAligned(dirPath=file.path(extdata(),"bowtie"),
pattern="*.bwt$",type="Bowtie",
filter=chromosomeFilter("2L"))
This concludes the overview of the ShortRead package. As the BAM format has become a de-facto standard, it is unlikely that you end up using that package to process reads in R, but rather the Rsamtools package presented next.
Rsamtools packageMost main-stream aligners produce output in SAM (text-based) or BAM format. A SAM file is a text file, with one line per aligned read, and fields separated by tabs. Here is an example of a single SAM line, split into fields.
fl <- system.file("extdata", "ex1.sam", package="Rsamtools")
strsplit(readLines(fl, 1), "\t")[[1]]
## [1] "B7_591:4:96:693:509"
## [2] "73"
## [3] "seq1"
## [4] "1"
## [5] "99"
## [6] "36M"
## [7] "*"
## [8] "0"
## [9] "0"
## [10] "CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG"
## [11] "<<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7"
## [12] "MF:i:18"
## [13] "Aq:i:73"
## [14] "NM:i:0"
## [15] "UQ:i:0"
## [16] "H0:i:1"
## [17] "H1:i:0"
Fields in a SAM file are summarized in the following table:
| Field | Name | Value |
|---|---|---|
| 1 | QNAME | Query (read) NAME |
| 2 | FLAG | Bitwise FLAG, e.g., strand of alignment |
| 3 | RNAME | Reference sequence NAME |
| 4 | POS | 1-based leftmost POSition of sequence |
| 5 | MAPQ | MAPping Quality (Phred-scaled) |
| 6 | CIGAR | Extended CIGAR string |
| 7 | MRNM | Mate Reference sequence NaMe |
| 8 | MPOS | 1-based Mate POSition |
| 9 | ISIZE | Inferred insert SIZE |
| 10 | SEQ | Query SEQuence on the reference strand |
| 11 | QUAL | Query QUALity |
| 12+ | OPT | OPTional fields, format TAG:VTYPE:VALUE |
We recognize from the FASTQ format introduced earlier, the identifier string, read sequences and qualities. The alignment is to a chromosome ‘seq1’ starting at position 1. The strand of alignment is encoded in the ‘flag’ field. The alignment record also includes a measure of mapping quality, and a CIGAR string describing the nature of the alignment. In this case, the CIGAR is 36M, indicating that the alignment consisted of 36 Matches or mismatches, with no indels or gaps; indels are represented by I and D; gaps (e.g., from alignments spanning introns) by N.
BAM files encode the same information as SAM files, but in a format that is more efficiently parsed by software; BAM files are the primary way in which aligned reads are imported in to R.
As introduced in the previous section, there are three different packages to read alignments in R:
ShortRead
GenomicAlignments
Rsamtools
The last two will be described in more details in the next paragraphs.
The readGAlignments function from the GenomicAlignments package reads essential information from a BAM file into an instance of the GAlignments class. The GAlignments class has been designed to allow useful manipulation of many reads (e.g., 20 million) under moderate memory requirements (e.g., 4 GB).
Use the readGAlignments to read in the “ex1.bam” that can be found in the “extdata” folder of the Rsamtools package.
alnFile <- system.file("extdata", "ex1.bam", package="Rsamtools")
aln <- readGAlignments(alnFile)
head(aln, 3)
The readGAlignments function takes an additional argument: param that allows to streamline the information retrieved from a BAM files, e.g.: to retrieve alignments to known gene coordinates only.
A GAlignments instance is like a data frame, but with accessors as suggested by the column names. It is easy to query, e.g., the distribution of reads aligning to each strand, the length of reads (a.k.a. width), or the alignment cigar strings
Summarize the strand, width and CIGAR information from the aln object.
table(strand(aln))
##
## + - *
## 1647 1624 0
table(width(aln))
##
## 30 31 32 33 34 35 36 38 40
## 2 21 1 8 37 2804 285 1 112
head(sort(table(cigar(aln)), decreasing=TRUE))
##
## 35M 36M 40M 34M 33M 14M4I17M
## 2804 283 112 37 6 4
The GenomicAlignments package readGAlignments function only parses some of the fields of a BAM file, and that may not be appropriate for every usage. In such cases the scanBam function in Rsamtools provides greater flexibility. The idea is to view BAM files as a kind of data base. Particular regions of interest can be selected, and the information in the selection restricted to particular fields. These operations are determined by the values of a ScanBamParam object, passed as the named param argument to scanBam.
Consult the help page for ScanBamParam, and construct an object that restricts the information returned by a scanBam query to the aligned read DNA sequence. Your solution will use the what parameter of the ScanBamParam function.
Use the ScanBamParam object to query a BAM file, and calculate the GC content - using the gcFunction function - of all aligned reads. Summarize the GC content as a histogram:
param <- ScanBamParam(what="seq")
seqs <- scanBam(bamfiles[[1]], param=param)
readGC <- gcFunction(seqs[[1]][["seq"]])
hist(readGC)
The Rsamtools package has more advanced functionalities:
functions to count, index, filter, sort BAM files
functions to access the header only
the possibility to access SAM attributes (tags)
functions to manipulate the CIGAR string
the possibility to combine BAM libraries into a study set (BamViews)
…
Find out the function that permit to scan the BAM header and retrieve the header of the first BAM file in the bigdata() bam subfolder.
It contains the reference sequence length and names as well as the name, version and command line of the tool used for performing the alignments.
scanBamHeader(bamfiles[1])
The SAM/BAM format contains a tag: “NH” that defines the total number of valid alignments reported for a read. How can you extract that information from the same first bam file and plot it as an histogram?
param <- ScanBamParam(tag="NH")
nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH
So it seems a majority of our reads have multiple alignments! Some processing might be required to deal with these; i.e.: if reads were aligned to the transcriptome there exist tools that can deconvoluate the transcript specific expression, for example MMSEQ (Turro et al. 2011), BitSeq (Glaus et al. 2012), that last one existing as an R package too: BitSeq. Otherwise if reads were aligned to the genome, one should consider filtering these multiple alignments to avoid introducing artifactual noise in the subsequent analyses.
The CIGAR string contains interesting information, in particular, whether or not a given match contain indels. Using the first bam file, can you get a matrix of all seven CIGAR operations? And find out the intron size distribution?
param <- ScanBamParam(what="cigar")
cigars <- scanBam(bamfiles[[1]], param=param)[[1]]$cigar
cigar.matrix <- cigarOpTable(cigars)
intron.size <- cigar.matrix[,"N"]
intron.size[intron.size>0]
plot(density(intron.size[intron.size>0]))
hist(log10(intron.size[intron.size>0]),xlab="intron size (log10 bp)")
Look up the documentation for the BamViews and using the leeBamViews package, create a BamViews instance. Afterwards, use some of the accessors of that object to retrieve e.g. the file paths or the sample names
library(leeBamViews)
bpaths = dir(system.file("bam", package="leeBamViews"), full=TRUE, patt="bam$")
gt<-do.call(rbind,strsplit(basename(bpaths),"_"))[,1]
geno<-substr(gt,1,nchar(gt)-1)
lane<-substr(gt,nchar(gt),nchar(gt))
pd = DataFrame(geno=geno, lane=lane, row.names=paste(geno,lane,sep="."))
bs1 = BamViews(bamPaths=bpaths, bamSamples=pd,
bamExperiment=list(annotation="org.Sc.sgd.db"))
bamPaths(bs1)
bamSamples(bs1)
Finally, extract the coverage for the locus 861250:863000 on chromosome “Scchr13” for every sample in the object
sel <- GRanges(seqnames = "Scchr13",
IRanges(start = 861250, end = 863000),
strand="+")
covex = RleList(lapply(bamPaths(bs1),
function(x){
coverage(readGAlignments(x))[sel][["Scchr13"]]}))
This offer an interesting way to process multiple sample at the same time when you’re interested in a particular locus.
There are extensive vignettes for Rsamtools and GenomicAlignments packages.
In the previous chapter, we have seen how aligned reads can be manipulated and what are the crucial caveats to be aware of; i.e. stranded vs. non-stranded data and multi-mapping reads discard. In the present chapter, we will look at how genic annotation are retrieved and processed. Here again we will introduce how this is commonly done and what caveats this implies.
Bioconductor provides extensive annotation resources, summarized in the following figure. These can be gene-, or genome-centric. Annotations can be provided in packages curated by Bioconductor, or obtained from web-based resources.
Gene-centric AnnotationDbi packages include:
Organism level: e.g. org.Mm.eg.db, Homo.sapiens.
Platform level: e.g. hgu133plus2.db, hgu133plus2.probes, hgu133plus2.cdf.
Homology level: e.g. hom.Dm.inp.db.
System biology level: GO.db, KEGG.db, Reactome.db.
Examples of genome-centric packages include:
GenomicFeatures, to represent genomic features, including constructing reproducible feature or transcript databases from local file or web resources.
Pre-built transcriptome packages, e.g. TxDb.Hsapiens.UCSC.hg19.knownGene based on the UCSC hg19 knownGenes track.
BSgenome for whole genome sequence representation and manipulation.
Pre-built genomes, e.g., BSgenome.Hsapiens.UCSC.hg19 based on the UCSC hg19 build.
Web-based resources include
biomaRt to query biomart resource for genes, sequence, SNPs, and etc.
rtracklayer for interfacing with browser tracks, especially the UCSC genome browser.
genome projects often provide gff3 (General Feature Format) or gtf (Gene Transfer Format) formatted files as annotation sources, which can be manipulated with the rtracklayer or genomeIntervals packages.
Annotation Packages: the big picture
Here, we will focus on the Genome-centric type of approaches using the GenomicFeatures package.
GenomicFeaturesGenome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.
Load the ‘transcript.db’ package relevant to the dm3 build of D. melanogaster.
Use select and friends to select the Flybase gene ids of the top table and the Flybase transcript names (TXNAME) and Entrez gene identifiers (GENEID).
Use cdsBy to extract all coding sequences, grouped by transcript.
Subset the coding sequences to contain just the transcripts relevant to the top table.
How many transcripts are there?
What is the structure of the first transcript’s coding sequence?
Load the ‘BSgenome’ package for the dm3 build of D. melanogaster.
Use the coding sequences ranges of the previous part of this exercise to extract the underlying DNA sequence, using the extractTranscriptSeqs function.
Use Biostrings’s translate function to convert DNA to amino acid sequences.
The following loads the relevant Transcript.db package, and creates a more convenient alias to the TranscriptDb instance defined in the package.
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
We can discover available keys (using keys) and columns (columns) in txdb, and then use select to retrieve the transcripts associated with each differentially expressed gene. The mapping between gene and transcript is not one-to-one – some genes have more than one transcripts, which is biologically relevant, and some may have none, which might be due to different database versions being used or database cross-references (xref) inconsistancies.
set.seed(123)
fbids <- sample(keys(txdb),10,FALSE,)
txnm <- select(txdb, fbids, "TXNAME", "GENEID")
## 'select()' returned 1:many mapping between keys and columns
nrow(txnm)
## [1] 17
head(txnm, 3)
## GENEID TXNAME
## 1 FBgn0031701 FBtr0307036
## 2 FBgn0031701 FBtr0079073
## 3 FBgn0053819 FBtr0091823
The TranscriptDb instances can be queried for data that is more structured than simple data frames, and in particular return GRanges or GRangesList instances to represent genomic coordinates. These queries are performed using cdsBy (coding sequence), transcriptsBy (transcripts), , where the function argument by specifies how coding sequences or transcripts are grouped. Here we extract the coding sequences grouped by transcript, returning the transcript names, and subset the resulting GRangesList to contain just the transcripts of interest to us. The sixth transcript is composed of 8 distinct coding sequence regions.
cds <- cdsBy(txdb, "tx", use.names=TRUE)[txnm$TXNAME[6]]
cds[1]
## GRangesList object of length 1:
## $FBtr0073547
## GRanges object with 8 ranges and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chrX [11332696, 11332927] - | 59110 <NA>
## [2] chrX [11331237, 11332272] - | 59109 <NA>
## [3] chrX [11331056, 11331170] - | 59108 <NA>
## [4] chrX [11330808, 11330825] - | 59107 <NA>
## [5] chrX [11329762, 11330429] - | 59106 <NA>
## [6] chrX [11329582, 11329699] - | 59105 <NA>
## [7] chrX [11329416, 11329514] - | 59104 <NA>
## [8] chrX [11327313, 11327450] - | 59102 <NA>
## exon_rank
## <integer>
## [1] 2
## [2] 3
## [3] 4
## [4] 5
## [5] 6
## [6] 7
## [7] 8
## [8] 9
##
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome
The following code loads the appropriate BSgenome package; the object refers to the whole genome sequence represented in this package. The remaining steps extract the DNA sequence of each transcript, and translates these to amino acid sequences. Issues of strand are handled correctly.
library(BSgenome.Dmelanogaster.UCSC.dm3)
txx <- extractTranscriptSeqs(Dmelanogaster, cds)
length(txx)
## [1] 1
head(txx, 3)
## A DNAStringSet instance of length 1
## width seq names
## [1] 2424 ATGACCCTGACCACAACGACC...GTACGGAGCTCTCCACATAA FBtr0073547
head(translate(txx), 3)
## A AAStringSet instance of length 1
## width seq names
## [1] 808 MTLTTTTTASSAESQAKMDVK...EIDFNDLNMVRRGSTELST* FBtr0073547
One major caveat estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, can be aligned to several features ( transcripts or genes) if those alignments are of equivalent quality. This happens as a result of gene duplication and the presence of repetitive or common domains, for example. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a “synthetic” transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript-gene structure with a one to one relationship. As this procedure varies from organism to organism, there is, to the best of our knowledge, no tool available for performing this step.
Here, we will inspire ourselves (ha-ha) from the documentation of the R/Bioconductor easyRNASeq package (Delhomme et al. 2012); paragraph 7.1 of the package vignette.
First, we load the necessary libraries
library(IRanges)
library(genomeIntervals)
Next, we read in the GFF3 file
gff <- readGff3(file.path(extdata(),
"GFF3/Ptrichocarpa_v3.0_210_gene_exons.gff3.gz"),
quiet=TRUE)
And have a look at the Genome_Intervals object content
gff
## Object of class Genome_intervals_stranded
## 1150406 base intervals and 0 inter-base intervals(*):
## Chr01 - [1660, 2502]
## Chr01 - [1660, 2502]
## Chr01 - [1660, 2502]
## Chr01 - [1660, 2502]
## ... 1150398 other intervals...
## scaffold_991 + [13343, 13510]
## scaffold_991 + [13343, 13510]
## scaffold_991 + [13881, 14048]
## scaffold_991 + [13881, 14048]
##
## annotation:
## seq_name strand inter_base source type score phase
## 1 Chr01 - FALSE phytozome9_0 gene NA NA
## 2 Chr01 - FALSE phytozome9_0 mRNA NA NA
## 3 Chr01 - FALSE phytozome9_0 exon NA NA
## 4 Chr01 - FALSE phytozome9_0 CDS NA 0
## gffAttributes
## 1 ID=Potri.001G000100;Name=Potri.001G000100
## 2 ID=PAC:27043735;Name=Potri.001G000100.1;pacid=27043735;longest=1;Parent=Potri.001G000100
## 3 ID=PAC:27043735.exon.1;Parent=PAC:27043735;pacid=27043735
## 4 ID=PAC:27043735.CDS.1;Parent=PAC:27043735;pacid=27043735
## ... 1150398 other intervals...
## seq_name strand inter_base source type score phase
## 1150403 scaffold_991 + FALSE phytozome9_0 exon NA NA
## 1150404 scaffold_991 + FALSE phytozome9_0 CDS NA 0
## 1150405 scaffold_991 + FALSE phytozome9_0 exon NA NA
## 1150406 scaffold_991 + FALSE phytozome9_0 CDS NA 0
## gffAttributes
## 1150403 ID=PAC:26993235.exon.4;Parent=PAC:26993235;pacid=26993235
## 1150404 ID=PAC:26993235.CDS.4;Parent=PAC:26993235;pacid=26993235
## 1150405 ID=PAC:26993235.exon.5;Parent=PAC:26993235;pacid=26993235
## 1150406 ID=PAC:26993235.CDS.5;Parent=PAC:26993235;pacid=26993235
nrow(gff[gff$type=="exon",])
## [1] 460072
nrow(gff[gff$type=="mRNA",])
## [1] 73013
nrow(gff[gff$type=="gene",])
## [1] 41335
Here, we do not ask you to understand every detail of the implementation, but rather to get an idea of the process.
First, we identify the ID and Parents of all mRNA features and create a mapping table from it.
sel <- gff$type == "mRNA"
transcriptGeneMapping <- data.frame(getGffAttribute(gff[sel], "ID"),
getGffAttribute(gff[sel], "Parent")
)
head(transcriptGeneMapping)
## ID Parent
## 1 PAC:27043735 Potri.001G000100
## 2 PAC:27045395 Potri.001G000200
## 3 PAC:27045442 Potri.001G000300
## 4 PAC:27046907 Potri.001G000400
## 5 PAC:27046908 Potri.001G000400
## 6 PAC:27046909 Potri.001G000400
Next, we select the exon features and sort them in “groups” by their Parent.
sel <- gff$type=="exon"
rngList<- split(IRanges(start=gff[sel,1],end=gff[sel,2]),
transcriptGeneMapping[match(
sapply(strsplit(getGffAttribute(gff[sel,],"Parent"),","),"[",1),
transcriptGeneMapping$ID),"Parent"])
rngList
## IRangesList of length 41335
## $Potri.001G000100
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1660 2502 843
##
## $Potri.001G000200
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 6501 6646 146
## [2] 3506 3928 423
## [3] 2906 3475 570
##
## $Potri.001G000300
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 8391 8860 470
##
## ...
## <41332 more elements>
mostExons <- rev(names(table(elementNROWS(rngList))))[1]
mostExons
## [1] "659"
Then, we reduce the exon structure
rngList<- IRanges::reduce(rngList)
rngList
## IRangesList of length 41335
## $Potri.001G000100
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1660 2502 843
##
## $Potri.001G000200
## IRanges object with 3 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 2906 3475 570
## [2] 3506 3928 423
## [3] 6501 6646 146
##
## $Potri.001G000300
## IRanges object with 1 range and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 8391 8860 470
##
## ...
## <41332 more elements>
rev(names(table(elementNROWS(rngList))))[1]
## [1] "77"
Once we have reduced the transcript complexity, we can reconstruct a GFF structure from it.
First, we get the gff information; here we simply duplicate the first exon of every gene by the number of synthetic exons per gene. The content will be updated afterwards.
exons <- gff[sel,]
syntheticGeneModel<- exons[rep(
match(names(rngList),
transcriptGeneMapping[
match(sapply(strsplit(getGffAttribute(exons,"Parent"),","),"[",1),
transcriptGeneMapping$ID),"Parent"]),
elementNROWS(rngList)),]
Then, we update the coordinates, and change the source. This latter step is to document our changes; i.e. to make it obvious to anyone using that the newly generated annotation are not the original ones.
syntheticGeneModel[,1]<- unlist(start(rngList))
syntheticGeneModel[,2]<- unlist(end(rngList))
levels(syntheticGeneModel$source)<- "inhouse"
Next, we get the exon number for both strands. As the exons are correctly ordered on the minus strands by our first command - i.e. in decreasing order, we have to revert those on the plus strand.
exonNumber<- lapply(elementNROWS(rngList),":",1)
sel<- strand(syntheticGeneModel)[cumsum(elementNROWS(rngList))] == "+"
exonNumber[sel]<- sapply(exonNumber[sel],rev)
This is followed by the attributes (the ninth column) update.
syntheticGeneModel$gffAttributes<- paste("ID=",
rep(names(rngList),elementNROWS(rngList)),
":",unlist(exonNumber),";Parent=",
rep(names(rngList),elementNROWS(rngList)),".0",sep="")
Before, eventually writing the newly generated annotation in different formats.
writeGff3(syntheticGeneModel,file="~/Ptrichocarpa_v3.0_210_synthetic_transcripts.gff3")
sel <- syntheticGeneModel$type=="exon"
annot <- split(GRanges(seqnames=seq_name(syntheticGeneModel[sel]),
ranges=IRanges(start=syntheticGeneModel[sel,1],
end=syntheticGeneModel[sel,2]),
strand=strand(syntheticGeneModel[sel])),
getGffAttribute(syntheticGeneModel,"Parent")[sel,1]
)
## Warning: 'seq_name' is deprecated.
## Use 'seqnames' instead.
## See help("Deprecated")
save(annot,file="~/Ptrichocarpa_v3.0_210_synthetic_transcripts.rda")
Note: The same can actually be done with a single call to the “easyRNASeq” “createSyntheticTranscripts” function
synthTrx <- createSyntheticTranscripts(
file.path(extdata(),"GFF3/Ptrichocarpa_v3.0_210_gene_exons.gff3.gz"),
verbose=FALSE)
In this chapter, we will shortly go back to computational optimisation.
library(RnaSeqTutorial)
bamfiles <- dir(file.path(extdata(),"BAM"),
pattern="*.bam$",full.names=TRUE)
bamFileList <- BamFileList(bamfiles,yieldSize=10^6)
As you have been introduced to the GenomicAlignments functionalities to find count or summarize overlaps between reads and annotations, we can refine our prefered function. We had left it as:
gAlns <- mclapply(bamFileList,function(bamFile){
open(bamFile)
gAln <- GAlignments()
while(length(chunk <- readGAlignments(bamFile))){
gAln <- c(gAln,chunk)
}
close(bamFile)
return(gAln)
})
Using our synthetic transcript annotation: Ptrichocarpa_v3.0_210_synthetic_transcripts.rda, implement the count by chunks.
data("Ptrichocarpa_v3.0_210_synthetic_transcripts")
count.list <- mclapply(bamFileList,function(bamFile,annot){
open(bamFile)
counts <- vector(mode="integer",length=length(annot))
while(length(chunk <- readGAlignments(bamFile))){
counts <- counts + assays(summarizeOverlaps(annot,
chunk,
mode="Union",
ignore.strand=TRUE))$counts
}
close(bamFile)
return(counts)
},syntheticTrx[syntheticTrx$type == "exon"])
This gives us a list of counts per sample, to get a count matrix do:
count.table <- do.call("cbind",count.list)
head(count.table[rowSums(count.table)>0,])
## reads reads
## [1,] 2 3
## [2,] 3 2
## [3,] 2 0
## [4,] 12 5
## [5,] 2 1
## [6,] 0 3
Such a count table object is the minimal input that downstream analysis softwares - e.g.: DESeq2, edgeR, uses.
A similar function to this is probably all you“ll need to process your read and get a count table from a standard Illumina based RNA-Seq experiment. However, you might want more flexibility for you projects and certainly Bioconductor offer the possibilities to do that; examples of which are given in the next chapter.
In this chapter, now that we are familiar with what feature summarisation means, we will run one of the de-facto tools, namely “HTSeq”.
The read alignment step performed previously concluded the data pre-processing common to the majority of RNA-Seq based experiments. The following table details typical observed number of read sequences available following the data filtering and alignment steps. There are subsequently a large number of choices for performing downstream analyses for mRNA-Seq data. Probably the most common are to identify differential expression between conditions or to analyse sequence variants (Single Nucleotide Polymorphisms (SNP), INDELs (INsertion/DELetion), Copy Number Variants (CNVs)). However, some of these analyses, DE analysis for example - the topic of the remainder of this tutorial - require additional data-preparation.
| Step | Input Data | Usable reads | % of total | % removed from previous step |
|---|---|---|---|---|
| Raw | raw reads | 1,000,000 | 100 | 0 |
| SortMeRna | Raw reads | 970,000 - 990,000 | 97 - 99 | 1 - 3 |
| Trimommatic | Filtered reads / raw read | 776,000 - 891,000 | 78 - 89 | 10 - 20 |
| Aligner4 (STAR) | Trimmed / Filtered / raw reads | 620,800 - 801,900 | 62 - 81 | 10 - 205 |
| Analysis | Aligned reads | 620,800 - 801,900 | 62 - 81 | 0 |
This data preparation varies depending on whether expression at the gene or the transcript level is required. In our case, we are interested in gene expression, as we would like to perform a differential gene expression study.
A typical DE analysis data preparation consists of three steps, the first being to generate a non-redundant annotation, followed by the quantification/summation of the pre-processed reads aligned to each such feature before ultimately a QC is performed that assess whether the observed effects may have biological causes.
One major caveat estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, can be aligned to several features ( transcripts or genes) if those alignments are of equivalent quality. This happens as a result of gene duplication and the presence of repetitive or common domains, for example. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a “synthetic” transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript-gene structure with a one to one relationship. As this procedure varies from organism to organism, there is, to the best of our knowledge, no tool available for performing this step. The documentation of the R/Bioconductor easyRNASeq package (Delhomme et al. 2012) - see paragraph 7.1 of the package vignette details a way of doing this in R starting from a GTF/GFF3 annotation file. From the “genome.gff3” that was used during the alignment step, we obtain a synthetic-transcript.gff3 file.
The second step is to perform the intersection between the aligned position of reads (contained in the alignment BAM file) and the gene coordinates obtained in the previous step, to count the number of reads overlapping a gene. There are two primary caveats here: First the annotation collapsing process detailed above works on a gene-by-gene basis and hence is oblivious to the existence of genes that may overlap another gene encoded on the opposite strand. Second, aligners may return multiple mapping positions for a single read. In the absence of more adequate solution - see the next section on “DE analysis at the transcript level” for a example of what may be done - it is best to ignore multi-mapping reads.
A de-facto standard for counting is the htseq-count tool supplied as part of the HTSeq python library (Anders, Pyl, and Huber 2014). This associated webpage illustrates in greater detail the issues discussed above. For non-strand specific reads we suggest running htseq-count as follows:
htseq-count -f bam -r pos -m union -s no -t exon -i Parent \
sample-sortmerna-trimmomatic-STAR.bam synthetic-transcript.gff3 > \
sample-sortmerna-trimmomatic-STAR-HTSeq.txt
whereas for stranded data we advise using the following:
htseq-count -f bam -r pos -m intersection-nonempty -s reverse -t exon \
-i Parent sample-sortmerna-trimmomatic-STAR.bam synthetic-transcript.gff3 > \
sample-sortmerna-trimmomatic-STAR-HTSeq.txt
Using your STAR aligned BAM file, and the gff3 file available there: ~/share/Day1/data/reference/gff/Ptrichocarpa_v3.0_210_synthetic-gene-models-wo-introns.gff3, use the HTSeq htseq-count utility to create the count table for your sample (our data is not strand specific).
Before you proceed further, read about the count modes in HTSeq and think about why we choose the union mode.
This last chapter builds on what we have done in the five previous chapters and extend on the use of the R/Bioconductor packages. These give an extended flexibility of usage over tools such as HTSeq. Very few tools are addressing the caveats we mentioned in the previous chapters, and this is still a field under very active development. The aim of this chapter is to give you some extra insight, i.e. pointers; in case you would in your research be affected by one of the mentioned (or not) caveats.
library(RnaSeqTutorial)
bamfiles <- dir(file.path(extdata(),"BAM"),
pattern="*.bam$",full.names=TRUE)
This chapter6 describes part of an RNA-Seq analysis use-case. RNA-Seq (Mortazavi and others 2008) was introduced as a new method to perform Gene Expression Analysis, using the advantages of the high throughput of Next-Generation Sequencing (NGS) machines.
The goal of this use-case is to generate a count table for the selected genic features of interest, i.e. exons, transcripts, gene models, etc.
To achieve this, we need to take advantage of all the steps performed previously in the workshop.
the alignments information has to be retrieved
the corresponding annotation need to be fetched and possibly adapted e.g.: as was done in the preceding lecture.
the read coverage per genic feature of interest determined
Can you associate at least a Bioconductor package to every of these tasks?
There are numerous choices, as an example in the following we will go for the following set of packages:
Rsamtools
genomeIntervals
GenomicAlignments
In this section we will import the data using the GenomicAlignments package readGAlignments function. This will create a GAlignments object that contains only the reads that aligned to the genome.
Using what was introduced in the alignments section, read in the first bam file from the bigdata() bam folder. Remember that the protocol used was not strand-specific.
First we scan the bam directory:
bamfiles <- dir(file.path(extdata(), "BAM"), ".bam$", full=TRUE)
names(bamfiles) <- sub("_.*", "", basename(bamfiles))
Then we read the first file:
aln <- readGAlignments(bamfiles[1])
strand(aln) <- "*"
As we have seen, many of these reads actually align to multiple locations. In a first basic analysis - i.e.: to get a feel for the data - such reads could be ignored.
Filter the multiple alignment reads. Think of the “NH” tag.
param <- ScanBamParam(tag="NH")
nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH
aln <- aln[nhs==1,]
Now that we have the alignments ( object) and the synthetic transcript annotation ( object) - the one from the lecture; the same used in the Interlude ; we can quantify gene expression by counting reads over all exons of a gene and summing them together. One thing to keep in mind is that special care must be taken in dealing with reads that overlap more than one feature (e.g. overlapping genes, isoforms), and thus might erroneously be counted several times in different features. To deal with this we can use any of the approaches summarised in Figure 1:
Overlap modes: Image from the HTSeq package developed by Simon Anders.
The GenomicAlignments package summarizeOverlaps function offer different possibilities to summarize reads per features:
load("~/Ptrichocarpa_v3.0_210_synthetic_transcripts.rda")
annot <- syntheticTrx[syntheticTrx$type=="mRNA",]
counts1 <- summarizeOverlaps(annot, aln, mode="Union")
counts2 <- summarizeOverlaps(annot, aln, mode="IntersectionStrict")
counts3 <- summarizeOverlaps(annot, aln, mode="IntersectionNotEmpty")
Create a data.frame or a matrix of the results above and figure out if any differences can be observed. E.g check for difference in the row standard deviation (using the apply and sd functions).
synthTrxCountsTable <- data.frame(
assays(counts1)$counts,
assays(counts2)$counts,
assays(counts3)$counts)
colnames(synthTrxCountsTable) <- c("union","intStrict","intNotEmpty")
rownames(synthTrxCountsTable) <- rownames(counts1)
sds <- apply(synthTrxCountsTable,1,sd)
sum(sds!=0)
sum(sds!=0)/length(sds)
synthTrxCountsTable[which.max(sds),]
syntheticTrx[which.max(sds),]
So it appears that we have 3,176 cases where the counts differ; 8% of the total!!), and that the synthetic transcript “Potri.001G244700.0” shows the largest difference.
For a detailed analysis, it would be important to adequatly choose one of the intersection modes above, however for the remainder of this section, we will use the “union” set. As before for reads aligning to multiple places in the genome, choosing to take the union when reads overlap several features is a simplification we may not want to do. There are several methods that probabilistically estimate the expression of overlapping features: RSEM (B. Li et al. 2010), cufflinks (C. Trapnell et al. 2010), MMSeq (Turro et al. 2011).
This concludes that section on counting reads per known features. In the next section, we will look at how novel transcribed regions could be identified.
Note: This section is not essential to the workshop
One main advantage of RNA-seq experiments over microarrays is that they can be used to identify any transcribed molecule, including unknown transcripts and isoforms, as well as other regulatory transcribed elements. To identify such new elements, several methods are available to recreate and annotate transcripts, e.g. Cufflinks (C. Trapnell et al. 2010), Oases (Schulz et al. 2012), Trinity (Grabherr et al. 2011), to mention some of them. We can use Bioconductor tools as well, to identify loci and quantify counts without prior annotation knowledge. The example here is very crude and is really just a proof of concept of what one could do in a few commands i.e.: R rules.
But as a start and to make the results more precise, the reads have been realigned using STAR (Dobin et al. 2013), a very fast and accurate aligner that use the recent approach of Maximum Exact Matches (MEMs), see this website for more details. This MEM approach - actually a derivative called MMP (Maximum Mappable Prefix) - allow STAR to identify exon-exon junctions without prior knowledge e.g.: no need for an annotation gff. To start, we re-read one of the sample alignments using the GenomicAlignments package readGAlignments function.
aln <- readGAlignments(
BamFile(file.path(extdata(),"BAM","207_subset_sortmerna_trimmomatic_sorted.bam")))
The process begins with calculating the coverage, using the method from the GenomicAlignments package:
cover <- coverage(aln)
cover
# this object is compressed to save space. It is an RLE (Running Length Encoding)
# we can look at a section of chromosome 4 say between bp 1 and 1000
# which gives us the number of read overlapping each of those bases
as.vector(cover[["Chr19"]])[6553903:6561936]
The coverage shows us how many reads overlap every single base in the genome. It is actually split per chromosomes.
The next step is to define, “islands” of expression. These can be created using the slice function. The peak height for the islands can be determined with the viewMaxs function and the island widths can be found using the width function:
islands <- slice(cover, 1)
islandPeakHeight <- viewMaxs(islands)
islandWidth <- width(islands)
While some more sophisticated approaches can be used to find exons de novo, we can use a simple approach whereby we select islands whose maximum peak height is 2 or more and whose width is 150bp (150%) of the read size) or more to be candidate exons. The elementLengths function shows how many of these candidate exons appear on each chromosome:
candidateExons <- islands[islandPeakHeight >= 2L & islandWidth >=150L]
candidateExons[["Chr19"]]
Remember that we used an aligner which is capable of mapping reads across splice junctions in the genome.
sum(cigarOpTable(cigar(aln))[,"N"] > 0)
## [1] 473809
There are 473,809 reads that span exon-exon junctions (EEJs).
Let“s look up such a potential EEJ:
aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",]
## GAlignments object with 13782 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] Chr19 - 61M311N40M 101 90848 91259
## [2] Chr19 + 58M16870N31M 89 119801 136759
## [3] Chr19 + 88M997N13M 101 131190 132287
## [4] Chr19 - 64M997N37M 101 131214 132311
## [5] Chr19 + 31M997N68M 99 131247 132342
## ... ... ... ... ... ... ...
## [13778] Chr19 + 49M433N41M 90 15847572 15848094
## [13779] Chr19 - 8S75M881N18M 101 15848054 15849027
## [13780] Chr19 - 60M1114N41M 101 15848069 15849283
## [13781] Chr19 - 18M256N83M 101 15849314 15849670
## [13782] Chr19 - 91M169N10M 101 15849618 15849887
## width njunc
## <integer> <integer>
## [1] 412 1
## [2] 16959 1
## [3] 1098 1
## [4] 1098 1
## [5] 1096 1
## ... ... ...
## [13778] 523 1
## [13779] 974 1
## [13780] 1215 1
## [13781] 357 1
## [13782] 270 1
## -------
## seqinfo: 1446 sequences from an unspecified genome
aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",][3:6,]
## GAlignments object with 4 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] Chr19 + 88M997N13M 101 131190 132287 1098
## [2] Chr19 - 64M997N37M 101 131214 132311 1098
## [3] Chr19 + 31M997N68M 99 131247 132342 1096
## [4] Chr19 - 10M997N90M1S 101 131268 132364 1097
## njunc
## <integer>
## [1] 1
## [2] 1
## [3] 1
## [4] 1
## -------
## seqinfo: 1446 sequences from an unspecified genome
There are respectively 4 reads spanning what looks like introns of 997 bp. Note that the GenomicAlignments package Galignments class is aware of splicing junctions. Have a look at the coverage for the first intron:
cover[["Chr19"]][131278:132275]
## integer-Rle of length 998 with 8 runs
## Lengths: 2 13 473 1 89 418 1 1
## Values : 2 1 0 1 2 0 1 5
Now, we can have a look if we can identify a specific motif around the EEJ. We cheery pick - almost at random really - 2 EEJs, with 7 supporting reads each. Then using their cigar string, we calculate the position of the putative acceptor and donor sites.
splice.reads <- aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",]
cherry.pick.sel <- c(13492:13498,13511:13517)
read.start <- start(splice.reads)[cherry.pick.sel]
donor.pos <- read.start - 1 +
as.numeric(sapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M"),"[",1))
acceptor.pos <- read.start - 1 +
sapply(
lapply(
lapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M|N"),"[",1:2),
as.integer),
sum)
Then we load the chromosome Chr19 sequence, so that we can determine the sequence in the vicinity of the EEJs.
Chr19 <- readDNAStringSet(file.path(extdata(),
"FASTA",
"Ptrichocarpa_v3.0_210-Chr19.fa.gz"))
And retrieve the sequences spanning the EEJs. We pad each of them by 10bp on both sides.
donor <- Views(subject=Chr19[[1]],
start=donor.pos-8,
end=donor.pos+11)
acceptor <- Views(subject=Chr19[[1]],
start=acceptor.pos-10,
end=acceptor.pos+9)
Oups, is not something missing in our reasoning? Yes, indeed, we have not considered the strand until now! A visual evaluation told us that the mRNA is probably on the minus strand whereas the second is on the plus strand; we consquently correct the sequences created above.
donor <- DNAStringSet(donor)
minus.acceptor <- reverseComplement(donor[1:7])
acceptor <- DNAStringSet(acceptor)
minus.donor <- reverseComplement(acceptor[1:7])
donor[1:7] <- minus.donor
acceptor[1:7] <- minus.acceptor
Let“s see if there”s a consensus in the sequences of 20bp centered around the potential acceptor and donor sites.
library(seqLogo)
pwm <- makePWM(cbind(
alphabetByCycle(donor)[c("A","C","G","T"),]/length(donor),
alphabetByCycle(acceptor)[c("A","C","G","T"),]/length(acceptor))
)
seqLogo(pwm)
GC content in aligned reads
Clearly the logo in the figure above is not exceptional, but from only 2 EEJs, we can already see that the donor site at position 10-11 is GT and the acceptor site at position 30-31 is AG, i.e.: the canonical sites. Moreover, we can see a relative - or at least I want to see it because I know it must be there - enrichment for Ts in the intron sequence, a known phenomenon. Hence, using a de-novo approach complemented by additional criteria can prove very efficient.
This concludes the section on summarizing counts. As you could realize, juggling with the different package for manipulating the alignment and annotation requires some coding. To facilitate this a number of “workflow” package are available at Bioconductor. The next section gives a brief introduction of easyRNASeq (obviously, a biased selection “)
Let us redo what was done in the previous section using the easyRNASeq package.
load the easyRNASeq package and its companion data package: RnaSeqTutorial.
Look up the help page on the simpleRNASeq function or use show to lookup that function definition and arguments.
The data package contains 4 BAM files located as indicated below. Create a BamFileList object from them.
The data package also contains a gff3 formated file - given below. Create a AnnotParam object from it.
Knowing that the RNA-Sequencing protocol was single-end and unstranded, create the appropriate BamParam object.
Finally, combine the AnnotParam and BamParam into a RnaSeqParam object and call the simpleRNASeq function, summarizing the reads by exons.
The BAM files are:
dir(file.path(extdata(),
pattern="^[A,T].*\\.bam$",
full.names=TRUE)
The GFF3 file is:
system.file(
"extdata",
"Dmel-mRNA-exon-r5.52.gff3",
package="RnaSeqTutorial")
And now we can proceed:
## we start by loading the packages
library("easyRNASeq")
library("RnaSeqTutorial")
## looking up the function definition
show(simpleRNASeq)
## creating the BamFileList
bamFileList <- getBamFileList(
dir(path=system.file("extdata",
package="RnaSeqTutorial"),
pattern="^[A,T].*\\.bam$",
full.names=TRUE))
## creating the AnnotParam object
annotParam <- AnnotParam(datasource=system.file(
"extdata",
"Dmel-mRNA-exon-r5.52.gff3",
package="RnaSeqTutorial"))
## creating the BamParam object
bamParam <- BamParam(paired=FALSE,stranded=FALSE)
## creating the RnaSeqParam
rnaSeqParam <- RnaSeqParam(annotParam=annotParam,
bamParam=bamParam,
countBy="exons")
## and we then get a SummarizedExperiment containing the counts table
sexp <- simpleRNASeq(
bamFiles=bamFileList,
param=rnaSeqParam,
verbose=TRUE
)
## and look at the counts
exon.count <- assays(sexp)$exons
head(exon.count[rowSums(exon.count)>0,])
And that is it. In a few commands - one really - you have achieved what it takes us all the day to do!
Try to re-run simpleRNASeq with a minimalistic set of argument to check it“s parameter detection ability.
sexp <- simpleRNASeq(
bamFiles=bamFileList,
param=RnaSeqParam(annotParam=annotParam),
verbose=TRUE
)
The simpleRNASeq function currently accepts the following type of datasource - specified through an AnnotParam object:
“biomaRt” use biomaRt to retrieve the annotation
“env” use a RangedData or GRanges class object present in the environment
“gff” reads in a gff version 3 file
“gtf” reads in a gtf file
“rda” load an RData object. The object needs to be named and of class RangedData or GRanges.
The reads can be summarized by:
exons
features (any features such as introns, enhancers, etc.)
transcripts
genes Ideally these are the result of a process similar to the one from the lecture where a gene is represented by the set of non overlapping loci (i.e. synthetic exons) that represents all the possible exons and UTRs of a gene. Such as I call them “synthetic transcripts” are essential when counting reads as they ensure that no reads will be accounted for several times. E.g., a gene can have different isoforms, using different exons, overlapping exons, in which case summarizing by exons might result in counting a read several times, once per overlapping exon. See the section \(7.1\) of the easyRNASeq vignette for details.
The results are returned in an SummarizedExperiment class object.
Have a closer look at the SummarizedExperiment object .
See the GenomicRange package SummarizedExperiment class for more details on last three accessors used in the following.
## the counts
head(assays(sexp)[[1]])
## some non empty counts
head(assays(sexp)[[1]][rowSums(assays(sexp)[[1]])!=0,])
## the sample info
colData(sexp)
## the 'features' info
rowRanges(sexp)
For more details and a complete overview of the easyRNASeq package capabilities, have a look at the easyRNASeq vignette.
vignette("easyRNASeq")
easyRNASeq is still under active development and as such the current version may still behave unexpectedly.
In addition, advanced checks are conducted on the data provided by the user to make decision on the overall process. This may need refinement.
The concerns raised by the analysis reported there https://stat.ethz.ch/pipermail/bioc-devel/2012-September/003608.html by Robinson et al. have been adressed. Both the original easyRNASeq method and the GenomicAlignments approach are provided, the later one being the default.
Integration of recent tools for Differential Expression expression analysis such as DESeq2 and DEXSeq is pending. Planned is an integration of the limma for enabling the voom+limma paradigm. Ideally, easyRNASeq would select the most appropriate analysis to be conducted based on the report by Soneson and Delorenzi(Soneson and Delorenzi 2013).
After obtaining the count table, numerous downstream analyses are available. Most often, such count tables are generated in a differential expression experimental setup. In that case, packages such as DESeq, DEXSeq, edgeR, limma (see voom+limma in the limma vignette), are some of the possibilities available in Bioconductor. Have a look at (Dillies et al. 2012) and (Soneson and Delorenzi 2013) to decide which tool/approach is the best suited for your experimental design. But, of course, counts can as well be used for other purposes such as visualization, using e.g.: the rtracklayer and GViz packages. Actually, there“s no real limitation of what one can achieve with a count table and it does not need be an RNA-Seq experiment; look at the DiffBind package for an example of using ChIP-Seq data for differential binding analyses.
library(RnaSeqTutorial)
As a running example, we use a dataset derived from a study performed in Populus tremula (Robinson, Delhomme et al., BMC Plant Biology, 2014}. The authors test the evolutionary theory suggesting that males and females may evolve sexually dimorphic phenotypic and biochemical traits concordant with each sex having different optimal strategies of resource investment to maximise reproductive success and fitness. Such sexual dimorphism should result in sex biased gene expression patterns in non-floral organs for autosomal genes associated with the control and development of such phenotypic traits. Hence, the authors, among other approaches have looked for gene expression differences between sex. This was achieved by an RNA-Seq differential expression analysis between samples grouped by sex. The samples have been sequenced on an Illumina HiSeq 2000, using TruSeq adapters, through a 101 cycles paired-end protocol, which yielded between 11 million and 34 million reads per sample.
First we read the HTSeq files in a matrix. The DESeq2 package now actually has a function to ease that process: DESeqDataSetFromHTSeqCount. Here we just process the samples in parallel using mclapply instead.
res <- mclapply(dir(file.path(extdata(),"htseq"),
pattern="^[2,3].*_STAR\\.txt",
full.names=TRUE),function(fil){
read.delim(fil,header=FALSE,stringsAsFactors=FALSE)
},mc.cores=2)
names(res) <- gsub("_.*_STAR\\.txt","",dir(file.path(extdata(),"htseq"),
pattern="^[2,3].*_STAR\\.txt"))
Then we extract the additional info that HTSeq writes at the end of every file detailing how many reads are aligned uniquely or not, have no or ambigous overlaps or are of too low a quality. That last category should not occur at this stage of the analysis, as we have sorted and trimmed the data before aligning.
addInfo <- c("__no_feature","__ambiguous",
"__too_low_aQual","__not_aligned",
"__alignment_not_unique")
and we then extract the reads
sel <- match(addInfo,res[[1]][,1])
count.table <- do.call(cbind,lapply(res,"[",2))[-sel,]
colnames(count.table) <- names(res)
rownames(count.table) <- res[[1]][,1][-sel]
Here we aggreagte the information about how many reads aligned, how many were ambiguous, etc
count.stats <- do.call(cbind,lapply(res,"[",2))[sel,]
colnames(count.stats) <- names(res)
rownames(count.stats) <- sub("__","",addInfo)
count.stats <- rbind(count.stats,aligned=colSums(count.table))
count.stats <- count.stats[rowSums(count.stats) > 0,]
Then we convert them as percentages and check them all
apply(count.stats,2,function(co){round(co*100/sum(co))})
## 202 207 213.1 221 226.1 229.1 229 235 236 239 244 303
## no_feature 7 8 7 7 7 7 8 7 7 7 8 7
## ambiguous 4 4 4 4 4 4 4 4 4 3 4 4
## alignment_not_unique 3 3 3 3 3 3 3 3 3 3 3 3
## aligned 86 85 86 87 86 86 86 86 86 86 86 86
## 305.3 309.1 310.3 337.1 349.2
## no_feature 7 7 7 7 8
## ambiguous 4 4 4 4 4
## alignment_not_unique 3 3 3 3 3
## aligned 87 86 86 86 85
As can be seen an average 86% of the reads aligned uniquely unambiguously to features. About 3% were mapping multiple locations, 4% were on ambigous features (as the data is non strand specific, we have used the “Union” counting scheme) and finally 7% align to no features. This is not at all unexpected given the fact that the alignments were done against the the draft of the P. tremula genome.
pal=brewer.pal(6,"Dark2")[1:nrow(count.stats)]
mar <- par("mar")
par(mar=c(7.1,5.1,4.1,2.1))
barplot(as.matrix(count.stats),col=pal,beside=TRUE,las=2,main="read proportion",
ylim=range(count.stats)+c(0,1e+7))
legend("top",fill=pal,legend=rownames(count.stats),bty="n",cex=0.8,horiz=TRUE)
par(mar=mar)
Here, one can clearly see a library size effect; the samples 229.1 and 349.2 have been sequenced deeper as these are the parents of a large scale F1 population established for another study. Appart from these, the amount of reads per library seems fairly equally distributed.
We can estimate how much of the genes are not expressed
sel <- rowSums(count.table) == 0
sprintf("%s percent",round(sum(sel) * 100/ nrow(count.table),digits=1))
## [1] "13.1 percent"
sprintf("of %s genes are not expressed",nrow(count.table))
## [1] "of 35309 genes are not expressed"
So 13.1% of the genes are not expressed out of a total of 35309 genes
To assess the validity of the replicates, we can look at different metrics.
We start by looking at the per-gene mean expression, i.e. the mean raw count of every gene across samples is calculated and displayed on a log10 scale
plot(density(log10(rowMeans(count.table))),col=pal[1],
main="mean raw counts distribution",
xlab="mean raw counts (log10)")
Then the same is done for the individual samples colored by sample type
pal=brewer.pal(8,"Dark2")
plot.multidensity(log10(count.table),col=rep(pal,each=3),
legend.x="topright",legend.cex=0.5,
main="sample raw counts distribution",
xlab="per gene raw counts (log10)")
## NULL
As one can see, most samples show the similar trends, a few samples are shifted to the right - those that were sequenced deeper, but this does not affect the global shape of the curve.
For visualization, the data is submitted to a variance stabilization transformation using DESeq2. The dispersion is estimated independently of the sample type We nonetheless define the metadata fo importance, i.e. every sample sex and year of sampling
samples <- read.csv(file.path(extdata(),"sex-samples.csv"))
conditions <- names(res)
sex <- samples$sex[match(names(res),samples$sample)]
date <- factor(samples$date[match(names(res),samples$sample)])
And then we create the DESeq2 object, using the sample name as condition (which is hence irrelevant to any comparison)
dds <- DESeqDataSetFromMatrix(
countData = count.table,
colData = data.frame(condition=conditions,
sex=sex,
date=date),
design = ~ condition)
Once this is done, we first check the size factors and as there is no big variation, we decide to go for a variance stabilisation transformation over a regularised log transformation. Check the DESeq2 vignette for more details about this.
dds <- estimateSizeFactors(dds)
sizes <- sizeFactors(dds)
names(sizes) <- names(res)
sizes
## 202 207 213.1 221 226.1 229.1 229 235 236 239 244 303
## 0.79 1.06 0.88 0.81 0.91 1.89 0.93 1.51 1.09 1.06 0.90 0.99
## 305.3 309.1 310.3 337.1 349.2
## 0.65 0.71 1.10 0.80 1.94
boxplot(sizes,main="relative library sizes",ylab="scaling factor")
Let us do the vst
colData(dds)$condition <- factor(colData(dds)$condition,
levels=unique(conditions))
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
vst <- assay(vsd)
colnames(vst) <- colnames(count.table)
The vst introduces an offset, i.e. non expressed gene will have a value set as the minimal value of the vst. To avoid lengthy unnecessary explanations, we simply shift all values so that non expressed genes have a vst value of 0
vst <- vst - min(vst)
It is then essential to validate the vst effect and ensure its validity. To that extend, we visualize the corrected mean - sd relationship. As it is fairly linear, meaning we can assume homoscedasticity. the slight initial trend / bump is due to genes having few counts in a few subset of the samples and hence having a higher variability. This is expected.
meanSdPlot(assay(vsd)[rowSums(count.table)>0,])
Here is what would be observed by a log2 transformation of the raw count data
meanSdPlot(log2(as.matrix(count.table[rowSums(count.table)>0,])))
## Warning: Removed 8894 rows containing non-finite values (stat_binhex).
or for the count adjusted for the library size factor
meanSdPlot(log2(counts(dds,normalized=TRUE)[rowSums(count.table)>0,]))
## Warning: Removed 8894 rows containing non-finite values (stat_binhex).
We perform a Principal Component Analysis (PCA) of the data to do a quick quality assessment, i.e. replicate should cluster and the first dimensions should be explainable by biological means.
pc <- prcomp(t(vst))
percent <- round(summary(pc)$importance[2,]*100)
smpls <- conditions
We color the samples by date and sex
sex.cols<-c("pink","lightblue")
sex.names<-c(F="Female",M="Male")
And then check wheter the observed separation is due to the sample sex
scatterplot3d(pc$x[,1],
pc$x[,2],
pc$x[,3],
xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
color=sex.cols[as.integer(factor(sex))],
pch=19)
legend("bottomright",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
legend=c("Color:",sex.names[levels(factor(sex))]))
par(mar=mar)
This does not seem to be the case at all. There are 2 clusters of points, but both equally contain pink and blue dots. So, we next color by sampling date
dat <- date
scatterplot3d(pc$x[,1],
pc$x[,2],
pc$x[,3],
xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
color=pal[as.integer(factor(dat))],
pch=19)
legend("topleft",pch=rep(c(19,23),each=10),col=rep(pal,2),legend=levels(factor(dat)),bty="n")
par(mar=mar)
And here we are… The sampling data is a STRONG CONFOUNDING FACTOR in our analysis. So let’s do a final plot with the color = sex and shape = date
scatterplot3d(pc$x[,1],
pc$x[,2],
pc$x[,3],
xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
zlab=paste("Comp. 3 (",percent[3],"%)",sep=""),
color=sex.cols[as.integer(factor(sex))],
pch=c(19,17)[as.integer(factor(dat))])
legend("bottomright",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topleft",pch=c(NA,21,24),col=c(NA,1,1),
legend=c("Symbol:",levels(factor(dat))),cex=0.85)
par(mar=mar)
Sometimes the 3D PCA are not so easy to read and we may be just as well off looking at 2 dimensions at a time. First the 2 first dims
plot(pc$x[,1],
pc$x[,2],
xlab=paste("Comp. 1 (",percent[1],"%)",sep=""),
ylab=paste("Comp. 2 (",percent[2],"%)",sep=""),
col=sex.cols[as.integer(factor(sex))],
pch=c(19,17)[as.integer(factor(dat))],
main="Principal Component Analysis",sub="variance stabilized counts")
legend("bottomleft",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topright",pch=c(NA,21,24),col=c(NA,1,1),
legend=c("Symbol:",levels(factor(dat))),cex=0.85)
text(pc$x[,1],
pc$x[,2],
labels=conditions,cex=.5,adj=-1)
Clearly the sampling date separate most samples. It is interesting to see that the 2 samples sequenced deeper are in between the 2 clusters. We could hypothesize that the environmental change between sampling years are affecting less important genes, hence genes which expression levels are lower; i.e. somewhat looking at transcriptional “noise”. Since the 2 parent samples have been sequenced much deeper, the fact that they group with neither cluster might be due to that fact that they too have an increased proportion of transcriptional noise. Looking at the 2nd and 3rd dims does not reveal any obvious effects
plot(pc$x[,2],
pc$x[,3],
xlab=paste("Comp. 2 (",percent[2],"%)",sep=""),
ylab=paste("Comp. 3 (",percent[3],"%)",sep=""),
col=sex.cols[as.integer(factor(sex))],
pch=c(19,17)[as.integer(factor(dat))],
main="Principal Component Analysis",sub="variance stabilized counts")
legend("bottomleft",pch=c(NA,15,15),col=c(NA,sex.cols[1:2]),
legend=c("Color:",sex.names[levels(factor(sex))]))
legend("topright",pch=c(NA,21,24),col=c(NA,1,1),
legend=c("Symbol:",levels(factor(dat))),cex=0.85)
Another way to look at the data is to create a heatmap. As heatmaps are computationally demanding and hard to read if there is too much data, we look only at the 1000 most variable genes
sel <- order(apply(vst,1,sd),decreasing=TRUE)[1:1000]
First with their ID
heatmap.2(vst[sel,],labRow = NA,trace = "none",cexCol = 0.6 )
Then the sex
heatmap.2(vst[sel,],labRow = NA, labCol = sex,trace = "none",cexCol = 0.6 )
And the date
heatmap.2(vst[sel,],labRow = NA, labCol = date,trace = "none",cexCol = 0.6 )
library(RnaSeqTutorial)
samples <- read.csv(file.path(extdata(),"sex-samples.csv"))
res <- mclapply(dir(file.path(extdata(),"htseq"),
pattern="^[2,3].*_STAR\\.txt",
full.names=TRUE),function(fil){
read.delim(fil,header=FALSE,stringsAsFactors=FALSE)
},mc.cores=2)
names(res) <- gsub("_.*_STAR\\.txt","",dir(file.path(extdata(),"htseq"),
pattern="^[2,3].*_STAR\\.txt"))
addInfo <- c("__no_feature","__ambiguous",
"__too_low_aQual","__not_aligned",
"__alignment_not_unique")
sel <- match(addInfo,res[[1]][,1])
count.table <- do.call(cbind,lapply(res,"[",2))[-sel,]
colnames(count.table) <- names(res)
rownames(count.table) <- res[[1]][,1][-sel]
conditions <- names(res)
sex <- samples$sex[match(names(res),samples$sample)]
date <- factor(samples$date[match(names(res),samples$sample)])
pal=brewer.pal(8,"Dark2")
The QA revealed that we have a confounding factor. Luckily, enough the grouping of sex is balanced between sampling years and hence we will be able to block the year effect to investigate the sex effect.
yearSexDesign <- samples[grep("[2,3].*",samples$sample),c("sex","date")]
The primary reason to use DESeq over DESeq2 or edgeR or any other is that DESeq is very conservative. The second reason is that DESeq give less weight to outliers than DESeq2 does and based on the study by Pakull et al., which identify a candidate gene for sexual determination - the same we simultaneously, independantly originally identified in our analyses - it seems that our sample 226.1 may be mis-classified. However, the sex phenotyping has been done thoroughly, so it may also be that the sex determination is a more complex trait and that Potri.019G047300 is only one factor influencing it As introduced we want to look at the sex by blocking the year effect. We start by estimating the size factors and the dispersion
cdsFull <- newCountDataSet(count.table,yearSexDesign)
cdsFull = estimateSizeFactors( cdsFull )
cdsFull = estimateDispersions( cdsFull )
We can now check the dispersion estimated by DESeq, which is, obviously, conservative. I.e. most dispersion values lie below the actual dispersion fit by about 1 fold.
plotDispLSD(cdsFull)
Next we create both models (one considering the date only and on the date and sex)
fit1 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date + sex ))
## ...............................
fit0 = suppressWarnings(fitNbinomGLMs( cdsFull, count ~ date))
## ...............................
For the rest of the analysis, we ignore the genes that did not converge in the previous step
sel <- !is.na(fit1$converged) & !is.na(fit0$converged) & fit1$converged & fit0$converged
fit1 <- fit1[sel,]
fit0 <- fit0[sel,]
We next calculate the GLM p-value and adjust them for multiple testing
pvalsGLM = nbinomGLMTest( fit1, fit0 )
padjGLM = p.adjust( pvalsGLM, method="BH" )
We finally visualize the obtained results. A first insight shows that a number of genes have very high fold changes, which is due the fact that these genes have very few read in very few samples. For clarity, in the following plots, we filter those genes with a too high FC.
boxplot(rowSums(count.table[rownames(fit1[fit1$sexM > 20,]),]>0),
ylab="number of sample with expression > 0",
main="distribution of the # of samples with\nexpression > 0 for genes with extreme high log2FC"
)
boxplot(rowSums(count.table[rownames(fit1[fit1$sexM < -20,]),]>0),
ylab="number of sample with expression > 0",
main="distribution of the # of samples with\nexpression > 0 for genes with extreme low log2FC"
)
The vast majority of these genes are anyway not significant.
plot(density(padjGLM[fit1$sexM > 20]),
main="Adj. p-value for genes with extreme log2FC",
xlab="Adj. p-value")
plot(density(padjGLM[fit1$sexM > -20]),
main="Adj. p-value for genes with extreme log2FC",
xlab="Adj. p-value")
So we filter them out
sel <- fit1$sexM > -20 & fit1$sexM < 20
fit1 <- fit1[sel,]
padjGLM <- padjGLM[sel]
pvalsGLM <- pvalsGLM[sel]
A further look into the data reveals that one p-value is equal to 0. As this is inadequate for plotting log odds (-log of the p-value), we change these 0s to be 1 log10 fold change smaller than the otherwise smallest p-value (for plotting purposes only!)
padjGLM[padjGLM==0] <- min(padjGLM[padjGLM!=0])/10
pvalsGLM[pvalsGLM==0] <- min(pvalsGLM[pvalsGLM!=0])/10
As a volcano plot with the non-adjusted p-values The red dots circles are the 4 most significantly differentially expressed genes, and the 8 with the absolute highest log2 fold changes (4 positive and 4 negative)
heatscatter(fit1$sexM,-log10(pvalsGLM),
main="Male vs. Female Differential Expression",cor=FALSE,
xlab="Log2 Fold Change", ylab="- log(10) p-value")
legend("topleft",bty="n","1% p-value cutoff",lty=2,col="gray")
points(fit1[pvalsGLM<.01,"sexM"],
-log10(pvalsGLM[pvalsGLM<.01]),
col="lightblue",pch=19)
pos <- c(order(pvalsGLM)[1:4],
order(fit1$sexM,decreasing=TRUE)[1:4],
order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(pvalsGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")
As a volcano plot with adjusted p-values
heatscatter(fit1$sexM,-log10(padjGLM),
main="Male vs. Female Differential Expression",cor=FALSE,
xlab="Log2 Fold Change", ylab="- log(10) adj. p-value")
legend("topleft",bty="n","1% FDR cutoff",lty=2,col="gray")
points(fit1[padjGLM<.01,"sexM"],-log10(padjGLM[padjGLM<.01]),col="lightblue",pch=19)
pos <- c(order(padjGLM)[1:4],order(fit1$sexM,decreasing=TRUE)[1:4],order(fit1$sexM)[1:4])
points(fit1[pos,"sexM"],-log10(padjGLM[pos]),col="red")
abline(h=2,lty=2,col="gray")
As DESeq is an “older” implementation of this type of analysis, we will redo the analysis using DESeq2. To ultimately be able to compare the obtained set of differential expression candidate genes, we save the list of genes significant at a 1% adjusted p-value cutoff.
UmAspD <- rownames(fit1[padjGLM<.01,])
which are not so many (one)
UmAspD
## [1] "Potra000503g03273.0"
We redo the analyses performed above. As you will see, it has been made more intuitive in DESeq2 and the same can be achieved in fewer commands.
dds <- DESeqDataSetFromMatrix(
countData = count.table,
colData = data.frame(condition=conditions,
sex=sex,
date=date),
design = ~date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(dds)
res <- results(dds,contrast=c("sex","M","F"))
In 4 commands we have reproduced the analysis and we have observed that the dispersion estimation looked different. DESeq2 introduces a so called “shrinkage”, which you can learn more about in the DESeq2 vignettes. We, next, extract the results using the same 1% cutoff and plot similar validation plots
alpha=0.01
plotMA(res,alpha=alpha)
volcanoPlot(res,alpha=alpha)
## NULL
The volcano plot look similar to what we observed previously (except that we have by mistake inverted the ratio calculation, we now did F-M), although slightly less dead that in DESeq. We also do not get any genes with a p-value of 0
hist(res$padj,breaks=seq(0,1,.01))
sum(res$padj<alpha,na.rm=TRUE)
## [1] 3
4 genes are candidates for differential expression at a 1% adjusted p-value cutoff, which we also save for later comparison
UmAspD2 <- rownames(res[order(res$padj,na.last=TRUE),][1:sum(res$padj<alpha,na.rm=TRUE),])
UmAspD2
## [1] "Potra003910g23469.0" "Potra001414g11999.0" "Potra004533g25038.0"
We can already observe that one gene is common with the previous list, but that the chromosome 19 candidate has disappeared. This is surprising, given the Pakull et al. publication, that Potri.019G047300.0 (the homolog of Potra000503g03273) does not come out anymore as a differentially expressed candidate gene at a 1% FDR cutoff with DESeq2. But it’s adjusted p-value is 2.1% - see below. Moreover it’s cook distance - a measure of the homogeneity of the gene expression within a condition - is relatively high (2x the average) and might indicate the presence of an outlier. Looking into more details at the count values and sex assignment shows that the sample 226.1 behaves like a male sample with regards to that gene, so either the sample has been mis-sexed or the Pakull et al. results in P. tremuloides and P.tremula are only a partial view of the true biology.
as.data.frame(res)["Potra000503g03273.0",]
## baseMean log2FoldChange lfcSE stat pvalue padj
## Potra000503g03273.0 125 1.1 0.24 4.6 3.9e-06 0.02
data.frame(
counts=counts(dds,normalized=TRUE)["Potra000503g03273.0",],
sex,
date,
conditions)
## counts sex date conditions
## 202 7.64 F 2008 202
## 207 145.27 M 2010 207
## 213.1 0.00 F 2008 213.1
## 221 261.99 M 2008 221
## 226.1 23.02 F 2010 226.1
## 229.1 305.21 M 2008 229.1
## 229 79.92 M 2010 229
## 235 91.24 M 2010 235
## 236 0.92 F 2010 236
## 239 0.00 F 2010 239
## 244 2.22 F 2010 244
## 303 476.78 M 2008 303
## 305.3 87.38 F 2008 305.3
## 309.1 237.93 M 2008 309.1
## 310.3 49.12 F 2008 310.3
## 337.1 349.03 M 2008 337.1
## 349.2 5.15 F 2008 349.2
mcols(dds)[names(rowRanges(dds))=="Potra000503g03273.0",]
## DataFrame with 1 row and 36 columns
## baseMean baseVar allZero dispGeneEst dispFit dispersion dispIter
## <numeric> <numeric> <logical> <numeric> <numeric> <numeric> <numeric>
## 1 125 21809 FALSE 1.3 0.13 0.99 10
## dispOutlier dispMAP Intercept date2008 date2010 sexF sexM
## <logical> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 FALSE 0.99 6.2 0.64 -0.64 -0.56 0.56
## SE_Intercept SE_date2008 SE_date2010 SE_sexF SE_sexM MLE_Intercept
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 0.35 0.25 0.25 0.12 0.12 4.8
## MLE_date_2010_vs_2008 MLE_sex_M_vs_F WaldStatistic_Intercept
## <numeric> <numeric> <numeric>
## 1 -1.9 3.7 18
## WaldStatistic_date2008 WaldStatistic_date2010 WaldStatistic_sexF
## <numeric> <numeric> <numeric>
## 1 2.6 -2.6 -4.6
## WaldStatistic_sexM WaldPvalue_Intercept WaldPvalue_date2008
## <numeric> <numeric> <numeric>
## 1 4.6 4.7e-69 0.0099
## WaldPvalue_date2010 WaldPvalue_sexF WaldPvalue_sexM betaConv betaIter
## <numeric> <numeric> <numeric> <logical> <numeric>
## 1 0.0099 3.9e-06 3.9e-06 TRUE 18
## deviance maxCooks
## <numeric> <numeric>
## 1 183 0.77
Nevertheless, while waiting for the next time the tree flowers and we can confirm its sex with certainty, we can assume that the sample was mis-sexed and see what effect a sex correction would have. Given the data from Pakull et al. about that gene; assuming that the sample 226.1 was mis-sexed, we redo the DESeq2 analysis after swaping the sex of that sample
sex[conditions=="226.1"] <- "M"
dds <- DESeqDataSetFromMatrix(
countData = count.table,
colData = data.frame(condition=conditions,
sex=sex,
date=date),
design = ~ date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res.swap <- results(dds,contrast=c("sex","M","F"))
plotMA(res.swap,alpha=alpha)
volcanoPlot(res.swap,alpha=alpha)
## NULL
hist(res.swap$padj,breaks=seq(0,1,.01))
sum(res.swap$padj<alpha,na.rm=TRUE)
## [1] 7
UmAspD2.swap <- rownames(res.swap[order(res.swap$padj,na.last=TRUE),][1:sum(res.swap$padj<alpha,na.rm=TRUE),])
UmAspD2.swap
## [1] "Potra000503g03273.0" "Potra001218g10512.0" "Potra001519g12651.0"
## [4] "Potra003910g23469.0" "Potra005027g34078.0" "Potra181824g28189.0"
## [7] "Potra002176g16783.0"
Fair enough, the gene made it back in the list.
But, still, to assess the importance of the 226.1 sample, we redo the DESeq2 DE analysis without it (since we have enough replicate for that removal not to affect the power of our analysis)
sel <- conditions != "226.1"
dds <- DESeqDataSetFromMatrix(
countData = count.table[,sel],
colData = data.frame(condition=conditions[sel],
sex=sex[sel],
date=date[sel]),
design = ~ date+sex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
plotDispEsts(dds)
res.sel <- results(dds,contrast=c("sex","M","F"))
plotMA(res.sel,alpha=alpha)
volcanoPlot(res.sel,alpha=alpha)
## NULL
hist(res.sel$padj,breaks=seq(0,1,.01))
sum(res.sel$padj<alpha,na.rm=TRUE)
## [1] 8
UmAspD2.sel <- rownames(res.sel[order(res.sel$padj,na.last=TRUE),][1:sum(res.sel$padj<alpha,na.rm=TRUE),])
UmAspD2.sel
## [1] "Potra003910g23469.0" "Potra005027g34078.0" "Potra000503g03273.0"
## [4] "Potra001414g11999.0" "Potra001519g12651.0" "Potra004533g25038.0"
## [7] "Potra001218g10512.0" "Potra178727g27972.0"
We compare the results from the 4 approaches, the easiest way being to plot a Vann Diagram
plot.new()
grid.draw(venn.diagram(list(
UmAsp=UmAspD2,
UmAsp.swap=UmAspD2.swap,
UmAsp.sel=UmAspD2.sel,
UmAspD = UmAspD),
filename=NULL,
col=pal[1:4],
category.names=c("UmAsp - DESeq2",
"UmAsp - corrected",
"UmAsp - removed",
"UmAsp - DESeq")))
Unlike in the original analysis conducted in the study, where the Populus trichocarpa genome was used as a reference (the P. tremula genome assembly is still a draft assembly), where the gene Potri.014G155300.0 was constantly found by all 4 approaches and the gene Potri.019G047300.0 (Potra000503g03273.0 homolog) was found by DESeq and DESeq2, there is no overlap between both native methods. This is probably because the result in P. tremula are obscured by an assembly artefact. Indeed we know that the two genes in P. trichocarpa have a very high sequence similarity (being putatively very recent paralogs), which appears to have been collapsed into a single locus in P. tremula. We can also note that “messing” with the sample sex attribute or removing that sample leads to the identification of several more genes! It is unclear if these are the results of a technical error or if they would be worth investigating further.
sort(intersect(UmAspD2,UmAspD))
## character(0)
sort(intersect(UmAspD2.swap,UmAspD))
## [1] "Potra000503g03273.0"
sort(intersect(UmAspD2.sel,UmAspD))
## [1] "Potra000503g03273.0"
Analyses performed when asking Mike Love (DESeq2 developer) why DESeq2 seem so sensitive to misclassification The short answer was to use the Cook distance to further investigate that and that the 1% FDR cutoff was conservative. These are commented out on purpose. Note that the rowData function has been replaced by rowRanges in Bioconductor version 3.1 and above.
## mis-classified
# sex <- samples$sex[match(colnames(count.table),samples$sample)]
# date <- factor(samples$date[match(colnames(count.table),samples$sample)])
# ddsM <- DESeqDataSetFromMatrix(
# countData = count.table,
# colData = data.frame(condition=conditions,
# sex=sex,
# date=date),
# design = ~ date+sex)
# ddsM <- DESeq(ddsM)
# resM <- results(ddsM,contrast=c("sex","F","M"))
#
# counts(ddsM,normalized=TRUE)["Potri.019G047300.0",]
# resM["Potri.019G047300.0",]
# mcols(ddsM)[names(rowData(ddsM)) == "Potri.019G047300.0",]
#
# ## reclassified
# sexR <- sex
# sexR[5] <- "M"
# ddsR <- DESeqDataSetFromMatrix(
# countData = count.table,
# colData = data.frame(condition=conditions,
# sex=sexR,
# date=date),
# design = ~ date+sex)
# ddsR <- DESeq(ddsR)
# resR <- results(ddsR,contrast=c("sex","F","M"))
#
# counts(ddsR,normalized=TRUE)["Potri.019G047300.0",]
# resR["Potri.019G047300.0",]
# mcols(ddsR)[names(rowData(ddsR)) == "Potri.019G047300.0",]
#
# ## samples details
# sex
# date
#
# ## the model is not be affected by the reclassification
# lapply(split(sex,date),table)
# lapply(split(sexR,date),table)
#
This conclude this differential expression analysis. The following details the R version that was used to perform the analyses.
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12 (Sierra)
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## attached base packages:
## [1] grid stats4 parallel methods stats graphics grDevices
## [8] utils datasets base
##
## other attached packages:
## [1] hexbin_1.27.1
## [2] RnaSeqTutorial_0.99.4
## [3] VennDiagram_1.6.17
## [4] futile.logger_1.4.3
## [5] vsn_3.42.3
## [6] scatterplot3d_0.3-37
## [7] RColorBrewer_1.1-2
## [8] LSD_3.0
## [9] gplots_3.0.1
## [10] DESeq2_1.14.0
## [11] DESeq_1.26.0
## [12] locfit_1.5-9.1
## [13] BSgenome.Dmelanogaster.UCSC.dm3_1.4.0
## [14] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
## [15] ShortRead_1.32.0
## [16] BiocParallel_1.8.1
## [17] seqLogo_1.40.0
## [18] leeBamViews_1.10.0
## [19] BSgenome_1.42.0
## [20] rtracklayer_1.34.0
## [21] GenomicFeatures_1.26.0
## [22] AnnotationDbi_1.36.0
## [23] GenomicAlignments_1.10.0
## [24] Rsamtools_1.26.1
## [25] Biostrings_2.42.0
## [26] XVector_0.14.0
## [27] SummarizedExperiment_1.4.0
## [28] Biobase_2.34.0
## [29] GenomicRanges_1.26.1
## [30] GenomeInfoDb_1.10.0
## [31] IRanges_2.8.0
## [32] S4Vectors_0.12.0
## [33] genomeIntervals_1.30.0
## [34] intervals_0.15.1
## [35] easyRNASeq_2.10.0
## [36] BiocGenerics_0.20.0
## [37] lattice_0.20-34
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 tools_3.3.1 affyio_1.44.0
## [4] rpart_4.1-10 KernSmooth_2.23-15 Hmisc_3.17-4
## [7] DBI_0.5-1 colorspace_1.2-7 nnet_7.3-12
## [10] gridExtra_2.2.1 preprocessCore_1.36.0 chron_2.3-47
## [13] formatR_1.4 labeling_0.3 caTools_1.17.1
## [16] scales_0.4.0 genefilter_1.56.0 affy_1.52.0
## [19] stringr_1.1.0 digest_0.6.10 foreign_0.8-67
## [22] rmarkdown_1.1 htmltools_0.3.5 limma_3.30.2
## [25] RSQLite_1.0.0 BiocInstaller_1.24.0 hwriter_1.3.2
## [28] gtools_3.5.0 acepack_1.4.1 RCurl_1.95-4.8
## [31] magrittr_1.5 Formula_1.2-1 Matrix_1.2-7.1
## [34] Rcpp_0.12.7 munsell_0.4.3 stringi_1.1.2
## [37] yaml_2.1.13 edgeR_3.16.1 zlibbioc_1.20.0
## [40] plyr_1.8.4 gdata_2.17.0 splines_3.3.1
## [43] annotate_1.52.0 knitr_1.14 geneplotter_1.52.0
## [46] biomaRt_2.30.0 futile.options_1.0.0 XML_3.98-1.4
## [49] evaluate_0.10 latticeExtra_0.6-28 lambda.r_1.1.9
## [52] data.table_1.9.6 gtable_0.2.0 assertthat_0.1
## [55] ggplot2_2.1.0 xtable_1.8-2 survival_2.40-1
## [58] tibble_1.2 cluster_2.0.5
Thanks to the Workshop organizer (Ari Löytinoja.
Thanks to Martin Morgan ( R and Bioconductor guru among other things, not to say head of the Bioconductor core at FHCRC - he never ceases to amaze us) for part of the original of most the material in your hands that we used during the EMBO October 2012 workshop. Time flies.
Thanks to Ângela Gonçalves for the original material of the section on estimating expression.
Thanks to the all lecturers, it is always fun around you.
Finally, thanks to you the reader - whatever the support you’re reading this on - for having made it that far.
Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11: R106. doi:10.1186/gb-2010-11-10-r106.
Anders, Simon, Paul Pyl, and Wolfgang Huber. 2014. “HTSeq–A Python Framework to Work with High-Throughput Sequencing Data.” BioRxiv.
Bolger, Anthony M, Marc Lohse, and Bjoern Usadel. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics 30 (15): 2114–20. doi:10.1093/bioinformatics/btu170.
Bray, Nicolas L, Harold Pimentel, Pall Melsted, and Lior Pachter. 2016. “Near-optimal probabilistic RNA-seq quantification.” Nat Biotech 34 (5). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 525–27. http://dx.doi.org/10.1038/nbt.3519 http://10.1038/nbt.3519 http://www.nature.com/nbt/journal/v34/n5/abs/nbt.3519.html{\#}supplementary-information.
Burrows M, Wheeler DJ. 1994. “A Block-Sorting Lossless Data Compression Algorithm.” Technical Report 124. Palo Alto, CA: Digital Equipment Corporation.
Chambers, John M. 2008. Software for Data Analysis: Programming with R. New York: Springer. http://stat.stanford.edu/~jmc4/Rbook/.
Cock, Peter J A, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M Rice. 2009. “The Sanger Fastq File Format for Sequences with Quality Scores, and the Solexa/Illumina Fastq Variants.” Nucleic Acids Research, December. doi:10.1093/nar/gkp1137.
Dalgaard, Peter. 2008. Introductory Statistics with R. 2nd ed. Springer. http://www.biostat.ku.dk/~pd/ISwR.html.
Delhomme, Nicolas, Ismaël Padioleau, Eileen E. Furlong, and Lars M. Steinmetz. 2012. “EasyRNASeq: A Bioconductor Package for Processing Rna-Seq Data.” Bioinformatics in press: in press. doi:10.1093/bioinformatics/bts477.
Dillies, Marie-Agnès, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier, Marine Jeanmougin, Nicolas Servant, Céline Keime, et al. 2012. “A Comprehensive Evaluation of Normalization Methods for Illumina High-Throughput Rna Sequencing Data Analysis.” Brief Bioinformatics, September. doi:10.1093/bib/bbs046.
Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal Rna-Seq Aligner.” Bioinformatics 29 (1): 15–21. doi:10.1093/bioinformatics/bts635.
Durinck, Steffen, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma, and Wolfgang Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21 (16): 3439–40. doi:10.1093/bioinformatics/bti525.
Gentleman, Robert. 2008. R Programming for Bioinformatics. Computer Science & Data Analysis. Boca Raton, FL: Chapman & Hall/CRC. http://www.bioconductor.org/pub/RBioinf/.
Glaus, Peter, Honkela, Antti, Rattray, and Magnus. 2012. “Identifying Differentially Expressed Transcripts from Rna-Seq Data with Biological Variation.” Bioinformatics 28 (13): 1721–8. doi:10.1093/bioinformatics/bts260.
Grabherr, Manfred G, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, et al. 2011. “Full-Length Transcriptome Assembly from Rna-Seq Data Without a Reference Genome.” Nat Biotechnol 29 (7): 644–52. doi:10.1038/nbt.1883.
Holmes, Ian AND Harris. 2012. “Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics.” PLoS ONE 7 (2). Public Library of Science: e30126. doi:10.1371/journal.pone.0030126.
Kabacoff, Rob. 2010. R in Action. Manning. http://www.manning.com/kabacoff.
Kasprzyk, Arek. 2011. “BioMart: Driving a Paradigm Change in Biological Data Management.” Database (Oxford) 2011 (January): bar049. doi:10.1093/database/bar049.
Khan, Zia, Joshua S Bloom, Leonid Kruglyak, and Mona Singh. 2009. “A Practical Algorithm for Finding Maximal Exact Matches in Large Sequence Datasets Using Sparse Suffix Arrays.” Bioinformatics 25 (13): 1609–16. doi:10.1093/bioinformatics/btp275.
Kopylova, Evguenia, Laurent Noé, and Hélène Touzet. 2012. “SortMeRNA: Fast and Accurate Filtering of Ribosomal Rnas in Metatranscriptomic Data.” Bioinformatics 28 (24): 3211–7. doi:10.1093/bioinformatics/bts611.
Langmead, B., C. Trapnell, M. Pop, and S. L. Salzberg. 2009. “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.” Genome Biol. 10: R25.
Leśniewska, Anna, and Michał J Okoniewski. 2011. “RnaSeqMap: A Bioconductor Package for Rna Sequencing Data Exploration.” BMC Bioinformatics 12 (January): 200. doi:10.1186/1471-2105-12-200.
Li, Bo, Victor Ruotti, Ron M Stewart, James A Thomson, and Colin N Dewey. 2010. “RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty.” Bioinformatics 26 (4): 493–500. doi:10.1093/bioinformatics/btp692.
Li, H., and R. Durbin. 2009. “Fast and accurate short read alignment with Burrows-Wheeler transform.” Bioinformatics 25 (July): 1754–60.
———. 2010. “Fast and accurate long-read alignment with Burrows-Wheeler transform.” Bioinformatics 26 (March): 589–95.
Li, Heng, and Nils Homer. 2010. “A Survey of Sequence Alignment Algorithms for Next-Generation Sequencing.” Briefings in Bioinformatics 11 (5): 473–83. doi:10.1093/bib/bbq015.
Matloff, N. 2011. The Art of R Programming. No Starch Pess.
Meyer, Laurence R, Ann S Zweig, Angie S Hinrichs, Donna Karolchik, Robert M Kuhn, Matthew Wong, Cricket A Sloan, et al. 2013. “The Ucsc Genome Browser Database: Extensions and Updates 2013.” Nucleic Acids Res 41 (Database issue): D64–9. doi:10.1093/nar/gks1048.
Mortazavi, Ali, and others. 2008. “Mapping and Quantifying Mammalian Transcriptomes by Rna-Seq.” Nature Methods 5 (7): 621–8.
Robinson, Kathryn, Nicolas Delhomme, Niklas Mähler, Bastian Schiffthaler, Jenny Önskog, Benedicte Albrectsen, Pär Ingvarsson, Torgeir Hvidsten, Stefan Jansson, and Nathaniel Street. 2014. “Populus Tremula (European Aspen) Shows No Evidence of Sexual Dimorphism.” BMC Plant Biology 14 (1): 276–76. doi:10.1186/s12870-014-0276-5.
Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (January): 139–40.
Schulz, Marcel H, Daniel R Zerbino, Martin Vingron, and Ewan Birney. 2012. “Oases: Robust de Novo Rna-Seq Assembly Across the Dynamic Range of Expression Levels.” Bioinformatics 28 (8): 1086–92. doi:10.1093/bioinformatics/bts094.
Soneson, Charlotte, and Mauro Delorenzi. 2013. “A Comparison of Methods for Differential Expression Analysis of Rna-Seq Data.” BMC Bioinformatics 14: 91. doi:10.1186/1471-2105-14-91.
Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. 2010. “Transcript Assembly and Quantification by Rna-Seq Reveals Unannotated Transcripts and Isoform Switching During Cell Differentiation.” Nat Biotechnol 28 (5): 511–5. doi:10.1038/nbt.1621.
Turro, Ernest, Shu-Yi Su, Ângela Gonçalves, Lachlan J M Coin, Sylvia Richardson, and Alex Lewin. 2011. “Haplotype and Isoform Specific Expression Estimation Using Multi-Mapping Rna-Seq Reads.” Genome Biol 12 (2): R13. doi:10.1186/gb-2011-12-2-r13.
Wu, Thomas D, and Colin K Watanabe. 2005. “GMAP: A Genomic Mapping and Alignment Program for MRNA and Est Sequences.” Bioinformatics 21 (9): 1859–75. doi:10.1093/bioinformatics/bti310.
Do not worry if some of this is illegible to you at present, it should be clear by the end of the workshop if we do our job↩
See http://www.epigenesys.eu/images/stories/protocols/pdf/20150303161357_p67.pdf↩
The authors want to thank Martin Morgan for the original material of the present chapter↩
The alignment rate depends on the genome quality and completeness and can hence have a large range - the values presented here are from the Norway Spruce, a version 1 draft of the genome.↩
The values presented here report only uniquely aligning reads. In our example, the rate of non aligning reads is usually equal to the rate of multi-mapping reads, i.e. about 10% for both in the worst cases.↩
The author want to thank Ângela Gonçalves for parts of the present chapter↩